Skip to content

Latest commit

 

History

History
1899 lines (1284 loc) · 147 KB

specification.adoc

File metadata and controls

1899 lines (1284 loc) · 147 KB

Awkward Array Specification (1.0-pre1)

Introduction

Array programming is a general programming paradigm, similar to functional programming or object-oriented programming in its scope. Pioneered by APL in 1962, this paradigm has appeared primarily in interactive data processing languages aimed at data analysts and statisticians: S (1976), MATLAB (1984), S-PLUS (1988), R (1993), and Numpy (2005). Numpy, unlike its predecessors, is not a language in itself, but a foundational library for science and statistics in Python.

Array programming is mostly used to solve problems with regular, rectangular grids. It is possible to represent and operate on more complex data structures, but it is awkward to do so in current frameworks. Fortunately, this awkwardness may be hidden in a suite of higher-level array classes that present modest extensions of the array programming paradigm to the user.

awkward-array (awkward on PyPI) is an implementation of complex data structures and operations in Python, depending only on Numpy. Drop-in replacements optimized with compiled code or leveraging GPUs are foreseen, though they would present the same user interface as the Numpy-only implementation. Extensions for using awkward-array as Pandas columns, in Numba-compiled functions, and in distributed calculations with Dask are also foreseen. These arrays can already be used to view Arrow data and can be persisted and reconstituted in any data format supporting Numpy arrays, such as HDF5, or named binary blobs, such as ZIP or key-value stores.

As an array programming environment for general data structures, awkward-array represents arbitrary data in a columnar form (compatible with and a generalization of Arrow). Most operations are vectorized or vectorizable because they are applied to flat, homogeneous arrays in memory masquerading as the nested and variable-length data structures that are presented to the user. This allows for memory and computational efficiency that are hard to match with implementations on traditional “row-wise” data, as well as automatic porting to pure vector coprocessors, such as GPUs. The key is a random access, composable representation of jagged arrays in terms of unstructured Numpy arrays.

This specification

Despite awkward-array’s focus on Python and Numpy, the data model it represents is language-agnostic. This document describes awkward-array as a protocol for generalizing array programming to complex structures in any environment. It is a normative document in the sense that it describes what awkward-array in Python should do; discrepancies between this document and the Python implementation would usually be decided in favor of the document (though the document itself may have “bugs” to be fixed in a later version). It is also normative in the sense that other implementations, in C++ or MATLAB for instance, that claim to “adhere to the awkward-array specification” should also behave as described here.

The goal of this document is to describe an array programming interface that extends a basic array programming interface. The extended interface is called “awkward-array” and the basic interface is a well-known starting point, such as Numpy. For simplicity, we will use Python syntax and Numpy function names, but a library in a different language or based on a different array library may still adhere to the awkward-array specification by providing a translation of these idioms.

It is not a goal of this document to explain how the awkward-array features are implemented. Drop-in replacements for the Numpy-only awkward-array library and libraries in other environments are free to implement the specification differently. They will thus have different performance characteristics. This specification cannot provide any performance hints.

Documentation is not an explicit goal of this specification, though users may read it to learn how to use any compliant implementation. Unlike documentation, however, this document describes what an awkward-array library should do, not necessarily what it does. Users are encouraged to file issues here if a discrepancy is found in the Numpy-only implementation.

The major and minor version numbers of this specification, xx.yy, correspond to the major and minor version numbers of the Numpy-only implementation: xx.yy in xx.yy.zz. The revision number, zz, increases as the implementation strives toward compliance with the specification. This particular version of the specification is a pre-release; compliance is only required for released specifications.

A library may implement more features than those described in this specification without affecting its compliance. If it implements fewer classes or fewer methods, it can only be said to be “partially compliant,” and would have to enumerate the missing classes or methods.

Basic array protocol

Necessary features of the underlying array library are described in this section for two reasons: to establish a syntax used in the rest of this document and as a basis for translating idioms of non-Numpy libraries into the language of this document. For instance, another language may use a function to extract slices of an array, rather than Python’s array[start:stop:step] syntax. If so, the awkward-array idiom in that language would use the same function name and arguments, rather than the Python ones.

The underlying array library must have the following features or some translation thereof.

It must be possible to represent ordered sequences of primitive values, known as arrays. An array is homogeneous: all values in an array have the same dtype, or primitive data type. Primitive types include integers, floating point numbers, and boolean (true/false) values. An awkward-array library inherits the primitives of the underlying library — if a base array library doesn’t support complex numbers (for instance), then neither does its awkward-array extension. Arrays need not be fixed-size, contiguous in memory, with only fixed-bytesize primitives, but these are the simplest cases.

It must be possible to construct N-dimensional arrays for any positive integer N. We will refer to an array’s size in all dimensions as its shape, a tuple of positive integers whose product is the total number of elements in the array. The length of an array is the array’s size in its first dimension.

t must be possible to extract an array element by integer index or a tuple of N integer indexes for an N-dimensional array. In this document, we will use 0-indexing: 0 extracts the first element, 1 extracts the second, etc. If an index is greater than or equal to N, it is out of bounds. An array library with 1-indexing (1 extracts the first element) would correspondingly have 1-indexing in its awkward-array extensions. We will use a number between square brackets after the array, as in myarray[5], or a comma-separated tuple like myarray[3, 1, 4], to represent extraction.

The array library may provide a mechanism for counting from the end of an array dimension. In Python, -1 refers to the last element, -2 to the second-to-last element, etc. We will use that convention. Given a negative index i and a dimension size d, the standard index is i + d. If the standard index is still negative, the index is out of bounds.

It must be possible to slice an array dimension to get a subsequence of that array along any dimension. A slice is defined by a start index (lower bound element number), a stop index (upper bound element number), and an optional step (stride length, for skipping a regular number of input elements between each returned element). In this document, we will use inclusive start values (the lower bound is included in the resulting subsequence) and exclusive stop values (the upper bound is not included). A step of 1 is equivalent to no step, and the step must not be 0, though negative values may be allowed (to reverse order). If either start or stop is not provided, they may be assumed to be 0 and the size of the dimension (respectively) if step is positive, or one less than the size of the dimension and one less than 0 (respectively) if step is negative. If the difference between step and start is not an integer multiple of step, we take that the subsequence to be truncated at the last element before stop. If either start or step are beyond the bounds of the array, we take them to be truncated to the nearest legal values, which may result in an empty subsequence, but not an error. If negative indexes are allowed for element extraction, they may be allowed for slicing as well. The Python syntax for this operation is myarray[start:stop:step] (in which any start, stop, or step may be missing or None, and the second colon may be omitted if there is no step). We will use this syntax in this document.

It must be possible to mask elements in an array dimension by a 1-dimensional boolean array of the same size as that array dimension. The result of such an operation is a sequence in the same order containing only the elements for which the matching boolean value is true. The Numpy syntax for this operation is to put the boolean mymask array in square brackets after the array: myarray[mymask], but it may be a named function call.

It must be possible to gather elements in an array dimension by a 1-dimensional integer array, using the integer array as extraction indexes. The result of such an operation, denoted myarray[myindexes], is a sequence with the same length and order as the indexing array myindexes, containing elements from myarray. The same rules apply to the elements of the indexing array as for single-element extraction. (In Numpy, this is sometimes called “fancy indexing,” though sometimes that term encompasses masking as well, so we will use “gather” in this document, as this is what the operation is called in SIMD programming.) As with masking, this may be a named function call.

It must either be possible to apply selections to multiple dimensions in a single call or to apply a selection to a specified dimension, not necessarily the first. For instance, we could extract from the first dimension, slice the second, mask the third, and gather the fourth in Numpy by separating requests with commas: myarray[5, start:stop:step, mymask, myindexes]. Selecting in multiple dimensions would allow selection in a specified dimension by passing all-inclusive slices to all dimensions before the dimension of interest: myarray[:, :, :, selection]. Selecting a specified dimension would allow selecting multiple dimensions by composition, so either is sufficient.

It must be possible to map arithmetic operations across all elements of one or more arrays. Any kernel function of n primitive type arguments returning a primitive type result can be applied to n equal-shape arrays and return a single new array of results with the same shape. The kernel function must be pure (no side effects), and many would be expressible as special syntax, such as + for addition, - for subtraction, etc. In Numpy, these are called “universal functions” or “ufuncs,” but this is such a specific protocol that we use a more general word, mapped kernels.

If any arguments in a mapped kernel have a scalar primitive type, rather than an array, they should be replaced by a constant array of the correct shape before mapping the kernel. If an argument has the correct dimensionality but some of its dimensions have size 1 where the other arguments have a size greater than 1, this dimension should be similarly expanded to a constant before mapping. These expansions do not need to be literal — the result is calculated as though the scalar or singleton dimension were a constant array. This conceptual expansion is known as broadcasting in Numpy and in this document.

It must be possible to reduce an array by a binary arithmetic operation along a given dimension. The array is reduced in dimension by one; 1-dimensional arrays are reduced to primitive scalars. Empty dimensions or arrays may be reduced to the operation’s identity if the operation has an identity — it must return an error otherwise. The identity for addition is 0, multiplication is 1, and we may take the identity for minimization and maximization to be the largest and smallest values available in the primitive data type, respectively. For instance, the minimum of an empty array of floating point numbers may be taken to be infinity.

Any array library supporting these basic features may be extended as specified in this document.

Features provided by an awkward-array library

An awkward-array library provides the above features in the following new contexts.

  • Jaggedness: multidimensional arrays with a non-constant shape. A jagged array is an array of arbitrary-length lists.

  • Product types: extend a primitive type system with record structures — objects with named, typed fields.

  • Sum types: extend a primitive type system with tagged unions — values that may have one of a set of enumerated types. This permits arrays to be heterogeneous in a controlled way.

  • Option types: extend a primitive type system with a nullable type — values that may be “missing.”

  • Cross-references and cyclic references: extend a primitive type system with values that may be references to another array, including a parent of the array in which they reside. This adds “pointers” to the type system in a controlled way: references must point to elements of a specified array.

  • Opaque object types: allow array elements to virtually contain instances of any type expressible in a programming language, provided that it can be constructed strictly from elements of the array.

  • Object methods: adds user-defined methods to arrays, usually to emulate object methods as mapped kernels.

  • Indirection: allows arrays to be defined as a cherry-picked subset of other arrays.

  • Sparseness: allows arrays to be defined at a minority of their index positions.

  • Non-contiguousness: allows arrays to be non-contiguous in memory by mapping indexes. This virtually concatenates data from separate chunks into a single logical array without copying.

  • Laziness: allows arrays to be loaded or generated on demand, allowing arrays that have not yet been materialized to be treated on the same footing with arrays that have.

Jaggedness, product types, sum types, option types, and references extend the expressivity of basic arrays to a complete, hierarchical data model. General data containers like Protocol buffers, Thrift, Avro, and Parquet present this data model, with the exception of references.

Object types and methods generalize it further, allowing any type permitted by a programming language such as Python, with a loss of cross-language compatibility.

Indirection, sparseness, non-contiguousness, and laziness do not affect data type: they are purely low-level features.

Taken together, these features promote array programming to a wider set of applications.

High-level types

To describe how awkward-array extends basic array types, we start by defining a notation that encompasses both. Basic arrays can be fully described by their dtype and shape. These parameters are not sufficient for awkward-array.

Basic arrays

Any array, including those in awkward-array, can be thought of as a function that maps extraction indexes to values. The functional type of a basic, 1-dimensional array with value type T and length n could be written as

[0, n) -> T

That is, the array is a function that takes an integer greater than or equal to 0 and less than n as its only argument, and returns a value of type T. The possible value types are the primitive types of the basic array library. Knowing the array’s type signature as an extraction function is enough to deduce its behavior in slicing, masking, and gathering.

A 2-dimensional array with shape (n, m) is a function that returns a function.

[0, n) -> [0, m) -> T

That is, if we pass an integer [0, n) to the 2-dimensional array, we get a 1-dimensional array; if we pass an integer [0, m) to that 1-dimensional array, we get a primitive value of type T. This currying can be applied indefinitely to describe arrays of any dimensionality. The shape in a tuple syntax like Numpy’s is more concise, but we will need the longer form.

Jagged arrays

Jagged arrays are like multidimensional arrays in the number of integer arguments that must be passed before obtaining a scalar primitive type, but not all of the arguments have precise domains. A simple jagged array of length n, only one level of jaggedness, and primitive type T would be expressed as

[0, n) -> [0, inf) -> T

because the second argument may be any non-negative integer. Unlike a basic array, some values for this second parameter, allowed by the above expression, would be rejected in practice. For example, the array

[[1, 2, 3], [], [4, 5]]

would accept only [0, 3) as a second argument if the first argument is 0, would accept nothing (empty domain) if the first argument is 1, and would accept only [0, 2) if the first argument is 2. We have a choice between expressing the type signature fully, such that any arguments satisfying that signature would never fail to return a primitive value, and underexpressing it with [0, inf), which has the advantage that the length of the type signature does not scale with the length of the array itself. Brevity is more useful for our purposes.

In this document, we refer to a full listing of the sizes of subarrays as a jagged array’s jagged structure. Some operations on two or more jagged arrays can only be performed if they have the same jagged structure.

Continuing this line of reasoning, a doubly jagged array, such as

[[[1, 2, 3], []], [[4, 5]], []]

would have type

[0, n) -> [0, inf) -> [0, inf) -> T

Product types

Product types are variously known as “records,” “structs,” “compounds,” “classes,” etc. They are values that contain a fixed set of named, typed attributes. They are “products” in the sense of Cartesian products: the set of records containing a floating point x and an integer i is the Cartesian product of the set of floating point values and the set of integer values.

To extract an attribute from a record, we give the record the attribute’s name. In a dynamically typed language, this amounts to passing a string argument; in a statically typed language, this string is usually a parsed, checked, compile-time literal. In either case, we can express it as an extraction function like

'x' -> float64
'i' -> int64

The names are enumerated because there is a fixed set of choices, and each choice returns a potentially different type. An array of length n containing these records is

[0, n) -> 'x' -> float64
          'i' -> int64

In general, the form is a sequence of name, type specification pairs called fields that can be embedded in a type specification. Product types with the same set of fields, in any order, are equivalent, though an awkward-array library should maintain the user-specified order for readability.

Jagged arrays, basic arrays, and record structure may be freely intermixed. An awkward-array type with these elements is, in general, a tree:

[0, n) -> [0, inf) -> 'one'   -> bool
                      'two'   -> [0, inf) -> int64
                      'three' -> 'x' -> [0, inf) -> float64
                                 'y' -> complex128

Extraction indexes (integers) and field names (strings) can be commuted (swapped in order). Extraction indexes do not commute with other extraction indexes, as this would violate dimension order, and field names do not commute with other field names, as this would violate records of different nesting depths, but extraction indexes commute with field names. Reversing an integer index and a string name amounts to selecting column before row or row before column in a table.

Ignoring this distinction hides the distinction between an array of structs and a struct of arrays, so that array manipulation code does not depend on this difference (just as Numpy hides C vs. Fortran order as an internal array flag). For instance,

[0, n) -> 'x' -> [0, m) -> T1
          'y' -> [0, m) -> T2

is equivalent to

[0, n) -> [0, m) -> 'x' -> T1
                    'y' -> T2

because all fields in the record have the same array dimension m, this array dimension can be commuted toward the root. The second form is considered canonical: extraction indexes should be commuted toward the root to reduce redundancy.

The same commutation is possible for jagged dimensions (perhaps surprisingly). A jagged array of records is equivalent to a record of jagged arrays, all with the same jagged structure. If we had a detailed type schema that encoded this jagged structure, rather than hiding it under the symbol [0, inf), we could perform the same commutation on jagged arrays as on regular arrays. However, as a limitation of this notation,

[0, n) -> 'x' -> [0, inf) -> T1
          'y' -> [0, inf) -> T2

can’t be commuted to

[0, n) -> [0, inf) -> 'x' -> T1
                      'y' -> T2

because we do not know if fields x and y have the same jagged structure. (A record with a single field can be commuted.)

Sum types

Sum types are known as “tagged unions” in programming languages that support them. They are values that may be any one of an enumerated set of types. They are “sums” in the sense that they are dual to product types: if a record type with two fields of types T1 and T2 is denoted T1 * T2, a union type that can be T1 or T2 is T1 + T2, obeying a distributive law. Unions are useful for building heterogeneous arrays in a controlled way. Class inheritance in object-oriented programming is a limited case of sum typing.

We delimit the enumerated types in a sum type with a vertical bar. Consider, for instance, a value that may be a floating point number or a jagged list of booleans.

float64          |
[0, inf) -> bool

These may be embedded in any type specification.

[0, n) -> (float64          |
           [0, inf) -> bool )

Parentheses are required because the fields of a product type (denoted by adjacency) have higher operator precedence than the enumerations of a sum type. For instance,

[0, n) -> ('x' -> float64
           'y' -> float64
           'z' -> float64   |
           [0, inf) -> bool )

for unions of x, y, z records with jagged arrays of booleans, and

[0, n) -> ('x' -> float64
           'y' -> float64
           'z' -> float64 |
           'x' -> int64
           'y' -> int64   )

for unions of x, y, z floating point records with x, y integer records.

Sum types with the same set of enumerated types, in any order, are equivalent, though an awkward-array library should maintain the user-specified order for readability and an unambiguous type resolution order. (A value may be a member of more than one of the enumerated types.)

Sum types may be nested, though they are equivalent to their flattened form. For instance,

[0, n) -> (bool          |
           int64         |
           (float64    |
            complex128 ) )

is equivalent to

[0, n) -> (bool       |
           int64      |
           float64    |
           complex128 )

Option types

An important special case of sum types is to describe missing data with a null, None, na, or NaN token. This could be expressed as a union of the non-missing data type with a unit type for None, but building these constructions manually (as in Avro) becomes unwieldy when most data can be missing (is “nullable”) and it forces us to consider an array that can only be filled with None as a legitimate array type. Instead, we introduce another element: an option type.

Option types have one parameter, the non-missing type T, and indicate that values may be missing. When extracting data from such an array, the values might have type T or might be None (in Python). We denote this with a question mark and parentheses; for example,

[0, n) -> ?(float64)

or

[0, n) -> [0, inf) -> ?([0, 3) -> int64)

or

[0, n) -> ?('x' -> float64
            'y' -> float64)

Option types may be nested, though they are equivalent to their flattened form. For instance,

[0, n) -> ?(?(float64))

is equivalent to

[0, n) -> ?(float64)

Note that Numpy has a numpy.ma.MaskedArray type, but awkward-array has its own awkward.MaskedArray. The awkward-array masked type can contain non-basic arrays (jagged arrays, tables, etc.) and uses None to represent missing values, rather than numpy.ma.masked, an object with surprising properties.

Option types of sum types introduces another equivalence: a masked union is equivalent to a union in which any one (or more than one) of the union’s enumerated types is masked. For example, [1, "two", None] is a union of integers and strings and is masked. This may be

[0, 3) -> ?(int64         |
            <class 'str'> )

        - or -

[0, 3) -> (?int64        |
           <class 'str'> )

        - or -

[0, 3) -> (int64          |
           ?<class 'str'> )

        - or -

[0, 3) -> (?int64         |
           ?<class 'str'> )

The canonical form is the first, with the option type outside the union type, and no option types on any of the union’s enumerated types.

Cross-references and cyclic references

As described above, an awkward-array type can be a tree of primitive types, basic arrays, jagged arrays, product types, sum types, and option types. An arbitrary tree of types is powerful (see Protocol buffers, Thrift, Avro, and Parquet), but a given type tree sets an upper limit on the depth of values that are members of that type. There would be, for instance, no type specification for JSON, or even the subset of JSON corresponding to “lists that contain numbers or other lists.”

This constraint can be lifted by allowing the type specification to contain references to other parts of itself. We represent these references by assigning a subexpression with := and then referring to that subexpression by name, rather than rewriting the subexpression. If the reference is not contained within the subexpression it references, it is a cross-reference and this notation reduces verbosity.

In the following example of a cross-reference, a dataset contains x, y, z data points and moving windows representing contiguous ranges of these data points. The window has access to data beyond its own array element, but it does not have access to anything beyond the data field.

[0, n) -> 'data'   -> T0 := 'x' -> float64
                            'y' -> float64
                            'z' -> float64
          'window' -> [0, inf) -> T0

If the reference is contained within the subexpression it references, it is a cyclic reference and the assignment notation avoids infinite recursion.

For example, consider lists that contain numbers or other lists, such as the value below.

[[1.1, [2.2, [3.3, 4.4, []]]]]

Lists with this depth and no greater could be expressed as belonging to a sum type of jagged arrays nested four levels deep. Lists like the above at any depth can be expressed as belonging to

[0, n) -> T0 := (float64        |
                 [0, inf) -> T0 )

The subexpression T0 represents floating point numbers or jagged arrays of T0. Trees of any depth belong to this type, as do cyclic graphs. With these elements, we can represent a wide variety of data structures, as general as most programming languages.

Opaque object types

Data constructed from primitives, basic arrays, jagged arrays, product types, sum types, option types, cross-references, and cyclic references are called pure constructions. They are entirely expressible in terms of basic arrays, which are themselves portable across environments. However, if a programming task requires special types, such as instances with particular methods or inheritance from a third-party interface, then pure constructions would have to be wrapped.

Wrapped data are represented in the type system as opaque types, types outside of awkward-array’s system. For instance, we may emulate an array of opaque objects by constructing these objects from an awkward-array upon extraction. This opaque constructor, which may be a helper function rather than a class’s built-in constructor, is included in the type specification to represent such a case.

For example, an array of Python strings would be

[0, n) -> <class 'str'>

with a jagged array of character primitives hidden behind the string constructor.

Opaque objects cannot be shared across platforms the way that pure constructions can. Gaining flexibility in one way diminishes it in another.

Type representation

In awkward-array, the above types are represented by the following classes in awkward.type:

  • Type is an abstract base class for type objects. Types may be Types, Numpy dtypes, or Python callables.

  • ArrayType(n, m, ..., to) constructs a linear sequence like [0, n) -> [0, m) -> ... -> to. At least two arguments are required, and all arguments but the last must be positive integers or infinity (which is floating point) or a string, to make a single-field table. The last argument must be a basic array dtype (for a primitive type) or a Python callable (for an opaque type). ArrayType objects are recursively linked lists: each instance has two members, takes and to, where takes is a single number and to is the rest of the sequence.

  • TableType(**fields) constructs a product type from a mapping of field names to field types. The preferred way to construct a TableType is through the & operator, which glues single-field tables into a multi-field table. The TableType has get-item and set-item methods for manipulation as a Python dict.

  • UnionType(*possibilities) constructs a sum type from a sequence of types. The preferred way to construct a UnionType is through the | operator. The UnionType has get-item and set-item methods, as well as append, for manipulation as a list.

  • OptionType(type) constructs an option type from a type parameter. It has a type attribute.

  • There are no special classes for cross-references and cyclic references. Build an internally referenced type object to represent an internally referenced type. All Type methods except for those that attempt to yield equivalent Numpy types (dtype and shape) should be recursion-safe.

This module has two functions:

  • fromarray(array) returns the type specification of any array, whether an awkward-array or a basic (Numpy) array.

  • fromnumpy(shape, dtype, masked=False) returns the type specification of a basic (Numpy) array directly from shape, dtype, and masked parameters.

General properties of all array classes

Array classes are defined in the awkward.array submodule but are all accessible directly from the top-level awkward module. They all have the following properties.

  • They have primary constructors that define the array in terms of the most general components; for example, starts, stops, and content for JaggedArray and tags, offsets, and contents for UnionArray.

  • They have class-method constructors with more convenient constructors, such as JaggedArray.fromoffsets(offsets, content) and UnionArray.fromtags(tags, contents).

  • Primary constructor arguments are read-write properties of the array object.

  • Single-property validity tests, which verify conditions that are a function of only one property, are performed in the constructor and upon assignment. Errors raise exceptions.

  • Whole-array validity tests, which verify conditions that are relationships among properties in an array, are performed just before array evaluation. Errors raise exceptions.

  • All arrays have get-item methods that perform extraction, slicing, masking, and gathering. If the array contains a product type, a string selects a column and a sequence of strings selects multiple columns.

  • String indexes commute with extraction/slicing/masking/gathering, but they can’t be used in the same set of square brackets, the way extraction/slicing/masking/gathering can.

  • All arrays have a set-item method to add columns to a Table (product type), but only for this purpose. Elements of an awkward-array cannot be changed in place without digging down to the basic array (and that can have hard-to-predict consequences).

  • All arrays have a length (for len) and Pythonic iteration (for for loops).

  • String representations (str and repr in Python) show logical content in square brackets without commas (spaces only), limited to 6 elements at each level of depth (filling in missing with three dots: ...). In Python, the repr representation includes the top-most class name but str does not. String representations trigger whole-array validity tests.

  • All arrays can be pickled, which invokes the full serialization protocol described at the end of this document.

Properties and methods of all arrays

All arrays have the following read-only properties and methods.

  • type: a representation of the array type.

  • dtype and shape: attempt to express the array as a basic (Numpy) array, which can involve information loss. Raises an error if the array cannot be expressed in Numpy’s terms.

  • columns: a list of field names for the shallowest Table within the array that are valid Python identifiers. The allcolumns read-only property is not restricted to valid Python identifiers.

  • tolist(): returns a Pythonic representation of the data (also valid JSON, if all floating point values are finite). Nested arrays become nested Python lists and Table.Rows become dicts of name, field value pairs.

  • valid(): tests whole-array validity without raising an exception; returns True or False.

  • copy(...): copies an array object without recursively copying array content. Any explicitly passed arguments (none are required, but otherwise same arguments as the array’s constructor) are used instead of copying the array’s properties. Passing all arguments is equivalent to calling the class’s primary constructor.

  • deepcopy(...): like the above except that all array content is recursively copied.

  • empty_like(...), zeros_like(...), and ones_like(...): create arrays with the same structure but uninitialized, zeros, or ones for content.

  • astype(dtype): return a copy of the whole structure with basic arrays at the deepest level converted to dtype.

Reducer methods on all arrays

Reducers convert arrays into scalars, reducing the dimensionality by one. In awkward-array, this applies to the innermost dimension (like axis = len(array.shape) - 1 in Numpy) and it removes one level of jaggedness (the innermost) from jagged arrays. They are all mask-aware and treat NaN values in floating point types as though they were masked.

  • any(): returns True if any values are non-masked and non-zero. Arrays without non-masked values yield False.

  • all(): returns True if all non-masked values are non-zero. Arrays without non-masked values yield True.

  • count(): returns the number of non-masked values.

  • count_nonzero(): returns the number of non-masked, non-zero values.

  • sum(): returns the sum of all non-masked values or 0 if there are no non-masked values.

  • prod(): returns the product of all non-masked values or 1 if there are no non-masked values.

  • min(): returns the minimum of non-masked values. Arrays without non-masked values yield inf for floating point types and the maximum integer value for integer types.

  • max(): returns the maximum of non-masked values. Arrays without non-masked values yield -inf for floating point types and the minimum integer value for integer types.

The methods argmin() and argmax() are not reducers: they return an array of zero or one elements: zero if there are no non-masked values, a single index position of the non-masked value that minimizes or maximizes the array otherwise. Data in this form is usable as a gather index. When applied to another array, it returns an empty array or a singleton, depending on the length of the argmin() or argmax().

Global switches and types

The awkward.array.base.AwkwardArray abstract base class has switches to control global behavior. They may be set at any level: on the base class to affect all awkward-array types, all instances, or on a concrete class for all instances of that class, or on a single instance. In languages other than Python, an alternative mechanism may be substituted.

  • allow_tonumpy default: True, if False, any operation that would convert an awkward-array type into a basic (Numpy) array instead raises a RuntimeError.

  • allow_iter default: True, if False, any attempt to iterate over an awkward-array in Python (except via str or repr) would raise a RuntimeError.

  • check_prop_valid default: True, if False, skip single-property validity checks.

  • check_whole_valid default: True, if False, skip whole-array validity checks.

The default primitive types are also defined on awkward.array.base.AwkwardArray.

Awkward type Numpy dtype Purpose

DEFAULTTYPE

float64

default array content

CHARTYPE

uint8

array content for byte-granularity arrays

INDEXTYPE

int64

default type for indexes (signed so that subtraction does not change type)

TAGTYPE

uint8

default type for tagged union tags

MASKTYPE

bool_

type for byte masks

BITMASKTYPE

uint8

type for bit masks

BOOLTYPE

bool_

type for boolean data

Jaggedness

Jagged arrays have a logical structure that is independent of how they are represented in memory, but since awkward-array defines this structure in terms of a basic array library (Numpy), the structure we choose is a visible part of the awkward-array specification. This section presents many ways to represent jagged arrays, their advantages and disadvantages, before specifying the JaggedArray class itself. The JaggedArray class uses the most general representation internally with conversions to and from the other forms.

One natural way to represent a jagged array is to introduce markers in the serialized content where each variable-length nested list begins or ends, or to insert nested list sizes before each nested list (as in the Avro protocol) to avoid having to distinguish content values from markers. However, this “row-wise” representation interrupts vectorized processing of the content. Another natural way is to create an array of pointers to nested lists, like Numpy’s object array, but this is even worse because it additionally increases memory latency.

Columnar representations keep the contents of the nested lists in a single, contiguous array (a “column”). The ROOT file format was probably the first columnar representation of jagged arrays (1995), though the intention was for efficient packing and compression on disk, rather than processing in memory. However, the columnar arrays of a ROOT file may be transplanted into memory for efficient computation as well. The Parquet file format (2013) has a different columnar representation of jagged arrays, though it modifies (“shreds”) the data in a way that is hard to use without fully restructuring it. The Arrow format (2016) uses one of the methods described below to perform efficient calculations on data in memory.

The simplest way to represent a jagged array with columnar arrays is to store flattened content in one array and counts of the number of elements in each interior list in another array. The starting and stopping index of one element — an interior list — can unambiguously be determined by summing counts up to the element of interest. This operation is O(N) in array length N, unfortunately. It is, however, composable, in that nested lists of nested lists (and so on) can be constructed by setting one jagged array as the content of another. For example, to represent the following nested structure:

[[], [[1.1, 2.2, 3.3], [], [4.4, 5.5]], [[6.6, 7.7], [8.8]]]

we note that the first level of depth contains lists of length 0, length 3, and length 2. Inside that (and ignoring boundaries of the first level of depth), the second level of depth contains lists of length 3, 0, 2, 2, and 1. Inside that, the content consists of floating point numbers. (The type for this doubly jagged array is [0, inf) -> [0, inf) -> float64.) It can be represented by three arrays:

  • outer counts: 0, 3, 2

  • inner counts: 3, 0, 2, 2, 1

  • inner content: 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8

The inner jagged array instance has inner counts and inner content as its counts and content, and the outer jagged array instance has outer counts as its counts and the whole inner jagged array as its content. Recursively, we can construct jaggedness of any depth from a single JaggedArray class.

To address the random access problem, we can consider replacing counts with its integral, offsets. An offsets array is a cumulative sum of counts, which avoids the need to recompute the sum for each lookup. Given a counts array, we compute the offsets by allocating an array one larger than counts, filling its first element with 0, and filling each subsequent element i with offsets[i] = offsets[i - 1] + counts[i - 1]. Inversely, counts is the derivative of offsets, and can be derived with a vectorized counts = offsets[1:] - offsets[:-1]. (There is a vectorized algorithm for computing the cumulative sum as well.) The nested list at index i is content[offsets[i]:offsets[i + 1]]. The Arrow in-memory format uses offset arrays to define arbitrary length lists.

Like jagged arrays defined by counts, jagged arrays defined by offsets are composable, but unlike counts, any element may be accessed in O(1) time. There are only a few situations in which counts may be preferable:

  • counts are non-negative small integers, which can be packed more efficiently with variable width encoding and/or lightweight compression (both of which destroy O(1) lookup time anyway);

  • counts are position-independent, allowing a large dataset to be processed in parallel without knowing the absolute positions of each parallel worker’s chunks. This is particularly useful for generating large sequences when the total size of each chunk is not known until fully generated.

One shortcoming that counts and offsets share is that they can only describe dense content. The data for list i + 1 must appear directly after the data for list i. If we wish to view the jagged array with any interior elements removed, we would have to make a new copy of the content with those lists removed, which could trigger a deep recursive copy. It would be more efficient to allow the content to contain unreachable elements, so that these selections can be zero-copy views.

A jagged array based on counts can have unreachable elements: any content at indexes greater than or equal to sum(counts) are not in the logical view of the jagged array. A jagged array based on offsets can have uncreachable elements at indexes less than offsets[0] and greater than or equal to offsets[-1], assuming that we allow offsets[0] to be greater than 0. To allow interior elements to be unreachable, we have to generalize offsets into two arrays, starts and stops. These two arrays (nominally) have the same shape as each other and define the shape of the jagged array. The nested list at index i is content[starts[i]:stops[i]]. Given an offsets array, we can compute starts and stops by starts = offsets[:-1] and stops = offsets[1:].

A jagged array defined by starts and stops can skip any interior content, can repeat elements, can list elements in any order, and can even make nested lists partially overlap. Skipping elements is useful for masking, repeating elements is useful for gathering, and reordering elements is useful for optimizing data to minimize disk page-reads. (No use for partial overlaps is currently known.) A potential cost of separate starts and stops is that it can double memory use and time spent in validation tests. However, if the starts and stops happen to be dense and in order, they can be views of a single offsets array and if this case is detected, simplified calculations may be performed.

These three arrays — starts, stops, and content — overrepresent the logical structure of a jagged array. Two jagged arrays constructed from different starts/stops/content may be compatible for elementwise operations and may even be equal. An easy way to see this is to consider the fact that the starts/stops scheme allows content to be reordered without affecting the data it represents. Another consideration is that unreachable content may differ in values or length. Only an array defined by offsets (and their starts/stops equivalent) in which offsets[0] == 0 and offsets[-1] == len(content) have a one-to-one relationship between the logical elements of the jagged array and their underlying representation in terms of starts, stops, and content.

The starts/stops scheme is a very general way to describe a jagged array from the outside in, for efficient extraction, slicing, masking, and gathering. It is a tree structure with pointers (indexes) from the root toward the leaves. For reduction operations, however, we need pointers from the leaves toward the root: an array with (nominally) the same length as the content, indicating where each nested list begins and ends. (This is similar to database normalization, and the scheme used by Parquet, though the latter is highly transformed and bit-packed.)

The simplest inside-out scheme is to associate an integer with each content element, and distinct values of these integers indicate different nested lists. (This is closest to database normalization: aggregation over nested lists could then be performed by an SQL group-by.) For efficient access, especially if the jagged array is distributed and acted upon in parallel, we can stipulate that identical values must be contiguous, since content belonging to the same nested list must be contiguous in the starts/stops scheme. Such an array is called a uniques array. It underrepresents a jagged array in two ways:

  • it doesn’t specify an ordering of elements (though we can assume the content is in increasing order), and

  • it can’t express any empty lists (though we can assume that there are none).

Because of this underrepresentation, a uniques array can be used to generate a jagged array but can’t be used to represent one that is already defined by starts and stops. We can modify the definition of uniques to more fully specify a jagged array by requiring the unique values associated with every nested list to be the index of the corresponding starts element. This specialized uniques array is called parents.

For example, with a jagged array logically defined as

[[], [1.1, 2.2, 3.3], [], [4.4, 5.5], [6.6, 7.7], [8.8], []]

the starts, stops, and content are

  • starts: 0, 0, 3, 3, 5, 7, 8

  • stops: 0, 3, 3, 5, 7, 8, 8

  • content: 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8

and the parents array is

  • parents: 1, 1, 1, 3, 3, 4, 4, 5

The first three elements of parents (1, 1, 1) associate the first three contents (1.1, 2.2, 3.3) with element 1 of starts and stops. The next two elements of parents (3, 3) associate the next two contents (4.4, 5.5) with element 3 of starts and stops. The fact that parents lacks 0 and 2 indicate that these are empty lists. Only empty lists at the end of the jagged array are unrepresented unless the total length of the jagged array is also given. Out of order elements can easily be expressed because parents does not need to be an increasing array. Unreachable elements can also be expressed by setting these parents elements to a negative value, such as -1. However, repeated elements cannot be expressed, so a parents array cannot represent the result of a gather operation. Likewise, partial overlaps cannot be expressed.

Given a starts array and its corresponding parents, the following invariant holds for all 0 <= i < len(starts):

parents[starts[i]] == i

and the following holds for all 0 <= j < len(content) that are at the beginning of a nested list:

starts[parents[j]] == j

Although parents is a highly expressive inside-out representation, another that is sometimes useful, called index, consists of integers that are zero at the start of each nested list and increase by one for each content element. For instance, the above example has the following index:

  • index: 0, 1, 2, 0, 1, 0, 1, 0

These values are local indexes for elements within the nested lists. For all 0 <= j < len(content), the following invariant holds:

starts[parents[j]] + index[j] == j

It is also useful to wrap the index array as a jagged array with the same jagged structure as the original jagged array, because then it can be used in gather operations.

All of the above discussion has focused on jagged arrays and nested jagged arrays without any regular array dimensions — that is, without dimensions whose sizes are known to be constant. Jagged arrays are more general, so a regular array may be emulated by a jagged array with constant counts, but this clearly less efficient than storing the regular dimension sizes only once. Regular dimensions that appear after (or “inside”) a jagged dimension can be represented by simply including a multidimensional array as content in a jagged array. That is, to get an array of type

[0, inf) -> [0, m) -> T

construct a jagged array whose content is an array of type [0, m) -> T. Regular dimensions that appear before (or “outside”) a jagged dimension are harder: the starts and stops of the jagged array must both have the shape of these regular dimensions. That is, to get an array of type

[0, n) -> [0, inf) -> T

the starts and stops must be arrays of type [0, n) -> INDEXTYPE. In a counts representation, the counts must be an array of this type. This cannot be expressed in an offsets representation because offsets elements do not have a one-to-one relationship with logical jagged array elements (another argument for starts and stops over offsets).

Some applications of awkward-array may require data that is being filled while it is being accessed. This is possible if whole-array validity constraints on array shapes are not too strict. Assuming that basic arrays can be appended atomically, or at least their lengths can be increased atomically to reveal content filled before increasing their lengths, jagged arrays can atomically grow by

  1. appending content first,

  2. then appending stops,

  3. then appending starts.

The length of the content is allowed to be greater than or equal to the maximum stop value, and the length of stops is allowed to be greater than or equal to the length of starts. The logical length of the jagged array is taken to be the length of starts. As described above, starts and stops must have the same shape, but only for dimensions other than the first dimension.

Likewise, the length of the content may be greater than or equal to the length of the parents array. The parents array must have the same shape as the content in all dimensions other than the first.

JaggedArray

A JaggedArray is defined by three arrays, starts, stops, and content, which are the arguments of its constructor. Below are their single-property validity conditions. They may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • starts: basic array of integer dtype (default is INDEXTYPE) with at least one dimension and all non-negative values.

  • stops: basic array of integer dtype (default is INDEXTYPE) with at least one dimension and all non-negative values.

  • content: any array (default is a basic array of DEFAULTTYPE).

The whole-array validity conditions are:

  • starts must have the same (or shorter) length than stops.

  • starts and stops must have the same dimensionality (shape[1:]).

  • stops must be greater than or equal to starts.

  • The maximum of starts for non-empty elements must be less than the length of content.

  • The maximum of stops for non-empty elements must be less than or equal to the length of content.

The starts, stops, and content properties are read-write; setting them invokes the same single-property validity check as the constructor. In addition, a JaggedArray has the following read-write properties:

  • offsets: basic array of integer dtype (default is INDEXTYPE) with exactly one dimension, at least one element, and all non-negative values. Getting it would raise an error if the starts and stops are not compatible with a dense sequence of offsets. Setting it overwrites starts and stops.

  • counts: basic array of integer dtype (default is INDEXTYPE) with at least one dimension and all non-negative values. Setting it overwrites starts and stops.

  • parents: basic array of integer dtype (default is INDEXTYPE) with at least one dimension. Setting it overwrites starts and stops.

JaggedArray has the following read-only properties and methods:

  • index: index array with jagged structure.

  • regular(): returns a basic N-dimensional array if this jagged array happens to have regular structure; raises an error if not.

  • flatten(): returns the content without nested list boundaries. Equivalent to content in a special case: when the jagged structure is describable by an offsets array and offsets[0] == 0 and offsets[-1] == len(content). Use this method instead of content to ensure generality.

Get-item behavior

When a jagged array myarray is passed a selection in square brackets, it obeys the following rules.

If selection is an integer, the element at that index is extracted (handling negative indexes, if applicable). If the provided index is beyond the array’s range, an error is raised. For example,

myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
myarray[0]
# returns array([1.1, 2.2, 3.3])
myarray[1]
# returns array([], dtype=float64)
myarray[-1]
# returns array([4.4, 5.5])

If selection is a slice, elements selected by the slice are returned as a new jagged array (handling negative indexes, if applicable). For example,

myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
myarray[1:]
# returns <JaggedArray [[] [4.4 5.5]] at 7f02018afc18>
myarray[100:]
# returns <JaggedArray [] at 7f020c214438>

If selection is a non-jagged list or array of booleans, elements corresponding to True values in the mask are returned as a new jagged array. The mask must be 1-dimensional and the mask and jagged array must have the same length, or an error is raised. For example,

myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
mask = numpy.array([True, True, False])
myarray[mask]
# returns <JaggedArray [[1.1 2.2 3.3] []] at 7f020e8122b0>

If selection is a jagged array of booleans, sub-elements corresponding to True values in the jagged mask are returned as a new jagged array. If the jagged mask and the jagged array do not have the same jagged structure, an error is raised. For example,

myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
mask = awkward.JaggedArray.fromiter([[False, True, True], [], [True, False]])
myarray[mask]
# returns <JaggedArray [[2.2 3.3] [] [4.4]] at 7f02018af8d0>

If selection is a non-jagged list or array of integers, elements identified by the integer indexes are gathered as a new jagged array (handling negative indexes, if applicable). For example,

myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
myarray[[2, 0, 1, -1]]
# returns <JaggedArray [[4.4 5.5] [1.1 2.2 3.3] [] [4.4 5.5]] at 7f020c214438>

If selection is a jagged array of integers, sub-elements identified by the integer local indexes are gathered as a new jagged array (handling negative indexes, if applicable). If the length of the indexes is not equal to the length of the jagged array, an error is raised. For example,

myarray = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
indexes = awkward.JaggedArray.fromiter([[2, 2, 0], [], [1]])
myarray[indexes]
# returns <JaggedArray [[3.3 3.3 1.1] [] [5.5]] at 7f02018afa58>

If selection is a tuple, a multidimensional extract/slice/mask/gather operation (in any combination) is performed. Any errors encountered along the way are raised. For example,

myarray = awkward.JaggedArray.fromcounts([2, 0, 1], awkward.JaggedArray.fromiter(
              [[1.1, 2.2, 3.3], [], [4.4, 5.5]]))
myarray
# returns <JaggedArray [[[1.1 2.2 3.3] []] [] [[4.4 5.5]]] at 7f02018afba8>
myarray[2, 0, 1]
# returns 5.5
myarray[myarray.counts > 0, 0, -2:]
# returns <JaggedArray [[2.2 3.3] [4.4 5.5]] at 7f020c214438>

If selection is a string or a list or array of strings, the jagged column of the nested table or jagged subtable, respectively, for that column or those columns is returned. If there are no Table instances nested within content, this raises an error. For example,

myarray = awkward.JaggedArray.fromcounts([3, 0, 2], awkward.Table(
              x=[1, 2, 3, 4, 5],
              y=[1.1, 2.2, 3.3, 4.4, 5.5],
              z=[True, False, True, False, False]))
myarray["x"]
# returns <JaggedArray [[1 2 3] [] [4 5]] at 7f020e8122b0>
myarray[["x", "y"]]
# returns <JaggedArray [[<Row 0> <Row 1> <Row 2>] [] [<Row 3> <Row 4>]] at 7f02018af860>
myarray[["x", "y"]].columns
# returns ['x', 'y']

A string or a list or array of strings is also the only acceptable argument to set-item. Columns may be added to a jagged table, provided that the jagged structure of the new columns matches that of the table.

Mapped kernel behavior

If jagged arrays are passed into a Numpy ufunc (or equivalent mapped kernel), they are computed elementwise at the deepest level of jaggedness, adjusting for different starts/stops/content representations of the same logical structure, and broadcasting scalars and non-jagged values to the jagged structure. If not all jagged arrays have the same logical jagged structure or non-jagged arrays are not broadcastable to this structure (because they have different lengths), an error is raised.

For example,

a = awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [4.4, 5.5]])
b = awkward.JaggedArray([0, 3, 4], [3, 3, 6], [10, 20, 30, -9999, 40, 50])
c = numpy.array([100, 200, 300])
d = 1000

defines a as [[1.1, 2.2, 3.3], [], [4.4, 5.5]] and b as [[10, 20, 30], [], [40, 50]] (-9999 is unreachable). These have the same logical strucutre, but a different physical structure.

a.starts, a.stops
# returns (array([0, 3, 3]), array([3, 3, 5]))
b.starts, b.stops
# returns (array([0, 3, 4]), array([3, 3, 6]))

Nevertheless, they can be combined in the same ufunc because they have the same logical structure, matching sub-element to sub-element before computing. Basic array c is (conceptually) promoted to a jagged array before operating as an instance of jagged broadcasting, and d is promoted as usual for scalar broadcasting.

numpy.add(a, b)
# returns <JaggedArray [[11.1 22.2 33.3] [] [44.4 55.5]] at 7f02018afc50>
numpy.add(a, c)
# returns <JaggedArray [[101.1 102.2 103.3] [] [304.4 305.5]] at 7f02018afba8>
numpy.add(a, d)
# returns <JaggedArray [[1001.1 1002.2 1003.3] [] [1004.4 1005.5]] at 7f02018afd30>

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + b, a + c, and a + d.

Methods

JaggedArray reducers differ from generic reducers in that they only reduce the innermost level of jaggedness: inner nested lists are replaced with scalars, but the total structure is still an array. Hence, a reduced singly-jagged array is a non-jagged array, and a reduced doulby-jagged array is a singly-jagged array. The reduced array has the same length as the unreduced jagged array.

  • any(): returns an array of BOOLTYPE; each is True if the corresponding nested list has any non-masked, non-zero values and False if not or if the nested list has no non-masked values at all.

  • all(): returns an array of BOOLTYPE; each is True if the corresponding nested list’s only non-masked values are non-zero, including the case in which the nested list has no non-masked values at all; False otherwise.

  • count(): returns an array of INDEXTYPE, the number of non-masked values in each nested list.

  • count_nonzero(): returns an array of INDEXTYPE, the number of non-masked, non-zero values in each nested list.

  • sum(): returns an array with the same dtype as the content (if content has a well-defined dtype), the sum of non-masked values in each nested list. Lists with no non-masked values yield 0.

  • prod(): returns an array with the same dtype as the content (if content has a well-defined dtype), the product of non-masked values in each nested list. Lists with no non-masked values yield 1.

  • min(): returns an array with the same dtype as the content (if content has a well-defined dtype), the minimum of non-masked values in each nested list. Lists with no non-masked values yield inf for floating point types and the maximum integer value for integer types.

  • max(): returns an array with the same dtype as the content (if content has a well-defined dtype), the maximum of non-masked values in each nested list. Lists with no non-masked values yield -inf for floating point types and the minimum integer value for integer types.

The jagged argmin() and argmax() methods are not reducers: they return jagged arrays of the local index that minimizes or maximizes the non-masked values in each nested list. If a nested list has no non-masked values, the corresponding nested list in the output is empty. If an output nested list is not empty, it has exactly one value. Data in this form is usable in gather operations.

JaggedArray has the following structure manipulation methods:

  • cross(other): creates a jagged table with columns "0", "1", "2", etc. that is the cross-join of nested list in self and other. self and other must have the same length, and the resulting jagged table has the same length. This meethod can be chained: a.cross(b).cross(c).

  • argcross(other): like cross(other), except that the values in the table are not elements of content but their local indexes, usable in gather operations. Unlike cross(other), chains of argcross(other) produce nested tables with only "0" and "1" columns.

  • pairs() and argpairs(): like cross(self) and argcross(self) except that if the pair corresponding to local indexes i and j are included, the pair corresponding to local indexes j and i are not.

  • distincts() and argdistincts(): like pairs() and argpairs() except that pairs corresponding to local indexes i and i are not included.

  • JaggedArray.concatenate(arrays) and instance.concatenate(arrays): concatenates the jagged arrays, including instance if called as an instance method. The arrays is must be a list of jagged arrays, like numpy.concatenate.

  • JaggedArray.zip(columns) and instance.zip(columns): builds a jagged table from a set of columns (same constructor specification as the Table class, defined below). Includes instance if called as an instance method.

A JaggedArray may be created from one of the following alternate constructors.

JaggedArray.fromiter(iterable)

  • iterable: a list of lists of a primitive type, corresponding to a jagged array of some fixed depth: [0, n) -> [0, inf) -> T, [0, n) -> [0, inf) -> [0, inf) -> T, etc.

JaggedArray.fromoffsets(offsets, content)

  • offsets: basic array of integer dtype (default is INDEXTYPE) with exactly one dimension, at least one element, and all non-negative values.

  • content: any array (default is a basic array of DEFAULTTYPE).

JaggedArray.fromcounts(counts, content)

  • offsets: basic array of integer dtype (default is INDEXTYPE) with at least one dimension and all non-negative values.

  • content: any array (default is a basic array of DEFAULTTYPE).

JaggedArray.fromuniques(uniques, content)

  • uniques: basic array of integer dtype (default is INDEXTYPE) with exactly one dimension and the same length as content.

  • content: any array (default is a basic array of DEFAULTTYPE).

JaggedArray.fromparents(parents, content, length=None)

  • parents: basic array of integer dtype (default is INDEXTYPE) with exactly one dimension and the same length as content.

  • content: any array (default is a basic array of DEFAULTTYPE).

  • length: if not None, a non-negative integer setting the length of the resulting jagged array; useful for adding empty lists at the end or truncating.

JaggedArray.fromindex(index, content, validate=True)

  • index: basic array or jagged array of integer dtype (default is INDEXTYPE). If a jagged array, only a flattened version of the jagged array is considered. The basic or flattened index must have exactly one dimension and the same length as content.

  • content: any array (default is a basic array of DEFAULTTYPE).

  • validate: if True, raise an error if non-zero values are not exactly one greater than the previous and raise an error if index is jagged and the jagged structure of index differs from the jagged structure derived from its values.

JaggedArray.fromjagged(jagged)

  • jagged: jagged array to convert to the given class (without copying data, if possible).

JaggedArray.fromregular(regular)

  • regular: basic array (default has DEFAULTTYPE) with more than one dimension. The array’s regular shape is replaced with the corresponding jagged structure.

JaggedArray.fromfolding(content, size)

  • content: any array (default is a basic array of DEFAULTTYPE).

  • size: number of elements to fold into each nested list of the resulting jagged array, and the maximum number of elements for the last nested list if len(content) % size != 0.

Helper functions

The awkward.array.jagged submodule may define helper functions, such as the following.

  • offsetsaliased(starts, stops): returns True if the starts and stops arrays overlap in memory and are consistent with a single offsets array at starts.base (or equivalently, stops.base); False otherwise.

  • counts2offsets(counts): convert a counts array to an offsets array.

  • offsets2parents(offsets): convert an offsets array to a parents array.

  • startsstops2parents(starts, stops): convert a general starts/stops pair to a parents array.

  • parents2startsstops(parents, length=None): convert a parents array to a starts/stops pair, optionally with a given length. This length may cause empty nested lists to be added at the end of the starts and stops representing a jagged structure or it may truncate the jagged structure, depending on whether it is greater or less than parents.max().

  • uniques2offsetsparents(uniques): convert a uniques array to a 2-tuple of offsets and parents.

  • aligned(*jaggedarrays): return True if all jaggedarrays have the same jagged structure; False otherwise.

Product types

Product types, or arrays of records with a fixed set of named, typed fields can be conceptually represented as tables. The “row-wise” vs. columnar representations discussed in the Jaggedness section were first developed in the context of tables. The “row” and “table” terminology came from a discussion of tables: named, typed attributes are conventionally associated with columns of a data table, while anonymous data points fill the rows. A row-wise data representation can be replaced with a columnar representation by simply transposing it in memory, or at least writing each column of data to a separate, equal-length array. Columnar layouts have been used in tabular databases since TAXIR in 1969.

Numpy has a product type called a structured array or record array. This is a row-wise data representation, which would be hard to mix with columnar jagged arrays. Instead of using structured arrays from the base library directly, awkward-array defines a Table type with the same syntax.

Like Numpy’s structured arrays, Table columns are selected by strings in a get-item, these string get-items commute with extract/slice/mask/gather get-items, and they can’t be used in the same multidimensional tuple with extract/slice/mask/gather get-items. (Despite the tabular metaphors, columns are not a dimension in the sense of N-dimensional arrays; they’re a qualitatively different kind of accessor.) Unlike Numpy’s structured arrays, Table columns have no constraints on where they reside in memory: they may be strides across a Numpy structured array, they may be fully columnar arrays in an Arrow buffer, or they may be Numpy arrays, scattered in memory.

The Table interface hides the distinction between an array of structs and a struct of arrays, an important transformation for preparing data for vectorization. It is used to create objects whose attributes may be widely dispersed in memory, or (through a VirtualArray) not all loaded into memory. (To avoid materializing a VirtualArray, the string representation of Table.Row does not show internal data.)

Regularly divided tables, such as

[0, n) -> [0, m) -> "one"   -> bool
                    "two"   -> int64
                    "three" -> float64

can be expressed by giving all columns the same dimensionality (shape[1:]). This is because the above is equivalent to

[0, n) -> "one"   -> [0, m) -> bool
          "two"   -> [0, m) -> int64
          "three" -> [0, m) -> float64

which is a Table whose column arrays all have shape (n, m).

Table

A Table is defined by an arbitrary number of named arrays, which are columns of the table. A Table need not represent purely tabular data; if it is nested within a JaggedArray, it is a jagged table, and if it contains any JaggedArray, it is a stringy table. Columns may be generated from any basic array, awkward-array, or Python iterable, with DEFAULTTYPE as the default type of empty iterables.

The Table constructor permits the following argument patterns:

  1. Table(column1, column2, ...): initialize with unnamed column arrays. Column names are strings of integers starting with zero ("0", "1", "2", etc.).

  2. Table({"column1": column1, "column2": column2, ...}): initialize with a single dict (may be an ordered dict). Column names are keys of the dict.

  3. Table(column1=column1, column2=column2): initialize with keywords. Column names are the keywords.

Pattern 1 and pattern 2 are incompatible; the first argument is either a subclass of dict or not. More than one positional argument in pattern 2 is not allowed. Both of the first two patterns are compatible with pattern 3: they may be freely mixed, as long as column names are never repeated (impossible with pattern 1).

After construction, columns can be added, overwritten, and removed using Table’s set-item and del-item methods. The fact that Tables may be nested is the only reason awkward-arrays have set-item and del-item methods: to pass a new column to a nested Table or request that one of its columns be deleted. Columns maintain their order (following Python’s ordered dict semantics).

Table has no whole-array validity conditions. The columns might have different lengths, but the total length of the Table is given by the minimum length of all contained columns (zero if there are no columns).

A Table applies slices, masks, and gather indexes lazily: rather than immediately applying these selections, they are stored as an internal view and applied when a single column is selected. Thus, if any columns are VirtualArrays, they won’t be materialized unless that particular column is requested. Internal views must therefore be composed.

Table has the following read-write properties:

  • rowname: defaults to "Row", but may be any string. Can also be set by the Table.named alternate constructor. <<`Table.named(rowname, ...)`,See below>> for an explanation.

  • contents: the columns as an ordered dict. (This is an assignable view, not a copy.)

Table has the following read-only properties and methods:

  • base: if this Table is a view, base is the original table. If not, base is None.

Get-item behavior

When a table myarray is passed a selection in square brackets, it obeys the following rules.

If selection is a string, one column is pulled from the table. If the column lengths do not match, its length is truncated to the table length — the minimum of all column lengths. For example,

myarray = awkward.Table(x=[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8],
                        y=[100, 101, 102, 103, 104, 105, 106],
                        n=[0, 1, 2, 3, 4])
myarray
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 72afb63cba90>
myarray["x"]
# returns array([0. , 1.1, 2.2, 3.3, 4.4])
myarray["y"]
# returns array([100, 101, 102, 103, 104])
myarray["n"]
# returns array([0, 1, 2, 3, 4])
myarray[["x", "y"]]
# returns <Table [<Row 0> <Row 1> <Row 2> ... <Row 4> <Row 5> <Row 6>] at 7005965b6400>
myarray[["x", "y"]].columns
# returns ['x', 'y']
myarray[["x", "y"]].tolist()
# returns [{'x': 0.0, 'y': 100}, {'x': 1.1, 'y': 101}, {'x': 2.2, 'y': 102},
           {'x': 3.3, 'y': 103}, {'x': 4.4, 'y': 104}, {'x': 5.5, 'y': 105},
           {'x': 6.6, 'y': 106}]

If selection is any integer, slice, list or array of booleans, or list or array of integers, the extraction/slicing/masking/gathering operation is applied to the rows, as though it were any other array. For example,

myarray = awkward.Table(x=[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8],
                        n=[0, 1, 2, 3, 4])
myarray
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 70e1687f9a58>
myarray[3]
# returns <Row 3>
>>> myarray[3:]
# returns <Table [<Row 3> <Row 4>] at 7e55fe51a278>

The subset of rows have persistent numbers (e.g. “Row 3” in the sliced output is the same object as “Row 3” in the base) because Table views remember their internal viewing state.

Column-projection and extraction/slicing/masking/gathering is order-independent: get-item operations applied in either order return the same output (they commute). For example,

myarray["x"][-3:]
# returns array([2.2, 3.3, 4.4])
myarray[-3:]["x"]
# returns array([2.2, 3.3, 4.4])

This is because a single row of a table is represented by a Table.Row, which has a get-item method for its place in a Table. If a Table.Row is iterated over, its length and iteration correspond to the fields named as consecutive integer strings, starting from zero: "0", "1", "2", etc.

Column-projection and extraction/slicing/masking/gathering cannot be performed in the same tuple, and column-projection of nested tables cannot be performed in the same tuple. Nor do column-projections of nested tables commute. Attempting to do so would raise an erorr. For example,

points = awkward.Table(x=[0.0, 1.1, 2.2, 3.3], y=[0, 100, 101, 102, 103])
myarray = awkward.Table(points=points, n=[0, 1, 2, 3])'
myarray["points"]["x"]
# returns array([0. , 1.1, 2.2, 3.3])
myarray["points"]["y"]
# returns array([  0, 100, 101, 102])
myarray["n"]
# returnsarray([0, 1, 2, 3])

Tables inside of other awkward-array components may not be strictly rectangular. For example, a JaggedArray of Table is a jagged table:

myarray = awkward.JaggedArray.fromcounts([3, 0, 2], awkward.Table(
              x=[0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8],
              n=[0, 1, 2, 3, 4]))
myarray
# returns <JaggedArray [[<Row 0> <Row 1> <Row 2>] [] [<Row 3> <Row 4>]] at 7e33f10569e8>
myarray["x"]
# returns <JaggedArray [[0.  1.1 2.2] [] [3.3 4.4]] at 7e33e188c438>
myarray["n"]
# returns <JaggedArray [[0 1 2] [] [3 4]] at 7e33e188c470>

Other awkward-array components inside of tables may not be strictly rectangular. For example, a Table containing a JaggedArray is a stringy table:

myarray = awkward.Table(
              x=awkward.JaggedArray.fromcounts(
                  [4, 0, 2, 2, 1],
                  [0.0, 1.1, 2.2, 3.3, 4.4, 5.5, 6.6, 7.7, 8.8]),
              n=[0, 1, 2, 3, 4])
myarray
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 73ab6e406a20>
myarray["x"]
# returns <JaggedArray [[0.  1.1 2.2 3.3] [] [4.4 5.5] [6.6 7.7] [8.8]] at 73ab6a1a3e48>
myarray["n"]
# returns array([0, 1, 2, 3, 4])

TODO: multidimensional indexes through a Table.

Mapped kernel behavior

If tables are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is applied separately to each column. If multiple tables are passed into the same ufunc with different sets of columns, an error is raised, and if they have different lengths, an error is raised. For example,

a = awkward.Table(x=[0.0, 1.1, 2.2, 3.3, 4.4], n=[0, 1, 2, 3, 4])
b = awkward.Table(x=[0, 100, 200, 300, 400], n=[0, 100, 200, 300, 400])'
numpy.add(a, b)
# returns <Table [<Row 0> <Row 1> <Row 2> <Row 3> <Row 4>] at 74ce37c32320>
numpy.add(a, b).tolist()
# returns [{'x': 0.0, 'n': 0}, {'x': 101.1, 'n': 101}, {'x': 202.2, 'n': 202},
           {'x': 303.3, 'n': 303}, {'x': 404.4, 'n': 404}]

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + b.

Methods

A Table may be created from one of the following alternate constructors.

Table.named(rowname, ...)

  • rowname: a string to label Table.Row objects.

The row name is used for display purposes (so that “rows” have a more meaningful name in a science domain) and may be used by methods to distinguish types that are structurally identical. For instance, “positions” and “directions” in a 3-dimensional space may both contain columns named "x", "y", and "z", but they should be transformed differently when a coordinate system is rotated.

The existence of a label allows what would usually be a structural type system (tables are identified by the fields they contain) to be treated as a nominative type system (tables are identified by their type name).

Table.fromrec(recarray)

  • recarray: Numpy recarray

Table.frompairs(pairs)

  • pairs: list of 2-tuples of name (string) and array

Table.fromview(view, base)

  • view: None or 3-tuple of start, step, length (integers) or base array of gather indexes

  • base: another Table

Constructs a view into an existing Table, using a representation of views. None means no view (the new Table is identical to the base). The 3-tuple represents a slice in a basis that is independent of table length and is easier to compose: start is the starting element, same as a slice but strictly non-negative, step is a step size, same as a slice (cannot be zero), and length is the number of steps to take, rather than truncating by a stop. Gather indexes are the same as indexes that would be passed to get-item. A boolean mask can be converted into gather indexes with numpy.nonzero.

Sum types

Sum types, or tagged unions, allow us to build heterogeneous arrays. As a data type, tagged unions are needed to express a collection that mixes data of incompatible types, but our use of tagged unions is broader: we may want to mix data that reside in different columnar arrays, regardless of whether they’re different types. This allows us to express the result of a blend (in the SIMD sense) without copying data. For example, SparseArray needs to blend data from a sparse lookup table with zeros from a different source when it is sliced; it uses a UnionArray to represent that result.

The general structure of a UnionArray is a collection of arrays with a tags array to specify which is active in each element. If tags[i] is 3, then the array value at i is drawn from array 3. In Arrow terminology, the tags array is the “types buffer.”

If we always draw element i from the array at tags[i], then all other arrays would have to be padded with unreachable elements at i, what Arrow calls a “sparse union.” Instead, we add another array, an index to identify the elements to draw from the selected arrays; we use what Arrow calls a “dense union.” (Arrow calls this index the “offsets,” but it is more similar to the index of our IndexedArray than the offsets of our JaggedArray.)

Given a set of arrays contents, a tags array tags, and an index array index, the element at i is:

contents[tags[i]][index[i]]

It is possible to emulate an Arrow sparse union by setting the index to a simple numeric range (numpy.arange(len(tags))). It is possible to generate an index for a union whose contents are in order and have no padding:

index = numpy.full(tags.shape, -1)
for tag, content in enumerate(contents):
    mask = (tags == tag)
    index[mask] = numpy.arange(numpy.count_nonzero(mask))

In circumstances where the index can be derived, it does not need to be stored.

Regularly divided unions, such as

[0, n) -> [0, m) -> (int64 |
                     complex128)

can be expressed by giving the tags and index arrays a multidimensional shape. The length of the tags must be less than or equal to the length of the index, but all dimension sizes after the first must be identical.

UnionArray

A UnionArray is defined by two arrays and an ordered sequence of arrays. Below are their single-property validity conditions. Arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • tags: basic array of integer dtype (default is TAGTYPE) with at least one dimension and all non-negative values.

  • index: basic array of integer dtype (default is INDEXTYPE) with at least one dimension and all non-negative values.

  • contents (note plural): non-empty Python iterable of any arrays (default are basic arrays of DEFAULTTYPE).

The whole-array validity conditions are:

  • tags length must be less than or equal to index length.

  • tags and index must have the same dimensionality (shape[1:]).

  • The maximum of tags must be less than the number of arrays in contents.

  • The maximum of index must be less than the minimum length of contents arrays.

The tags, index and contents properties are read-write; setting them invokes the same single-property validity check as the constructor. In addition, a UnionArray has the following read-only properties:

  • issequential: is True if all contents are in order with no padding; in which case, the index is redundant and could be generated by UnionArray.fromtags.

Get-item behavior

When a union array myarray is passed a selection in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections are passed down to a nested Table, if it exists.

For example,

myarray = awkward.UnionArray.fromtags([0, 1, 1, 0, 0, 1], [
              numpy.array([1.1, 2.2, 3.3]),
              awkward.JaggedArray.fromiter([[100, 200, 300], [], [400, 500]])])
myarray
# returns <UnionArray [1.1 [100 200 300] [] 2.2 3.3 [400 500]] at 7f5e1aceb7b8>
myarray[1:5]
# returns <UnionArray [[100 200 300] [] 2.2 3.3] at 7f5e1acf0f98>
myarray[1, 2]
# returns 300

Some of these selections may not be valid for all contents. Whether their application raises an error depends on which contents are touched by the selection. That is, a user can avoid an indexing error by applying an appropriate mask to avoid selecting rows or columns from nested content where those rows or columns do not exist. For example,

myarray = awkward.UnionArray.fromtags([0, 1, 0, 0, 1], [
              numpy.array([1.1, 2.2, 3.3]),
              awkward.JaggedArray.fromiter([[100, 200, 300], [400, 500]])])
myarray
# returns <UnionArray [1.1 [100 200 300] 2.2 3.3 [400 500]] at 7f5e1aceb630>
myarray[myarray.tags == 1, :2]
# returns <JaggedArray [[100 200] [400 500]] at 7f5e1aceb7b8>

A second dimensional index would be wrong for contents[0], a basic 1-dimensional array of floating point numbers. By masking with myarray.tags == 1, we ensure that this index is not applied where it shouldn’t be.

Mapped kernel behavior

If union arrays are passed into a Numpy ufunc (or equivalent mapped kernel), they are computed separately for each of the contents (if possible) and those results are combined into a new union array as output. They do not need to have the same set of tags, but they need to have the same lengths.

For example,

a = awkward.UnionArray.fromtags([0, 1, 1, 0, 0, 1], [
        numpy.array([1.1, 2.2, 3.3]),
        awkward.JaggedArray.fromiter([[100, 200, 300], [], [400, 500]])])
a
# returns <UnionArray [1.1 [100 200 300] [] 2.2 3.3 [400 500]] at 7f5e1aceb710>
numpy.add(a, 10)
# returns <UnionArray [11.1 [110 210 310] [] 12.2 13.3 [410 510]] at 7f5e1aceb6d8>

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + 10.

Methods

A UnionArray may be created from one of the following alternate constructors.

UnionArray.fromtags(tags, contents)

  • tags: same as primary constructor.

  • contents: same as primary constructor.

This methods generates an index assuming that all contents are in order with no padding. Union arrays generated this way would always have issequential == True.

Option types

In type theory, option types may be considered a special case of sum types: ?T is the sum of T with a unit type; a unit type has only one possible value, null. As described above, we do not wish to introduce an array type whose only information content is the shape of the array.

Additionally, we implement option types in a different way from unions: as boolean masks. With the exception of IndexedMaskedArray, Each missing value in a masked array has only one bit of information, the fact that it is missing. A single boolean mask array suffices. An awkward-array library has three masked array types:

  • MaskedArray (superclass): the mask array has one boolean per byte.

  • BitMaskedArray: the mask array has one boolean per bit, with padding to fill a whole number of bytes.

  • IndexedMaskedArray: the mask array functions both as a mask, with a negative value like -1 indicating that an element is missing, and as an index, so that the content does not need to have unreachable elements. This can be important if content values are large, such as a wide Table.

Numpy has a numpy.ma.MaskedArray type that uses one boolean per byte to indicate missing values. Arrow defines all types as potentially masked with one boolean per bit to indicate missing values. Neither have an equivalent for IndexedMaskedArray.

With MaskedArray and BitMaskedArray, there is a two-fold ambiguity: should True mean that a value is missing or that a value is present? Both classes have a maskedwhen argument indicating which boolean value is a masked value (default is True, values of True in the mask array mean data are missing). Numpy’s numpy.ma.MaskedArray has maskedwhen = True, and Arrow’s bitmasks have maskedwhen = False.

With BitMaskedArray, there is another two-fold ambiguity: should bits read from most significant to least significant or least significant to most significant in each byte? This is a bit-level equivalent of the endianness ambiguity, but it is not decided by hardware because most CPU instruction sets don’t operate on individual bits. BitMaskedArray has an lsborder that is True for Least Significant Bit (LSB) ordering and False for Most Significant Bit (MSB) ordering. Arrow’s bitmasks have lsborder = True.

IndexedMaskedArray has an integer-typed mask array, so it has no maskedwhen. Any negative value corresponds to being masked.

Regularly divided optional types, such as

[0, n) -> [0, m) -> ?T

can be expressed by giving the mask arrays a multidimensional shape. This is not possible for BitMaskedArray, since bits cannot be shaped, nor can an exact length be prescribed, since bits must pack into bytes and therefore pad up to seven values. Therefore, BitMaskedArray additionally has a maskshape to define the sizes of all dimensions, including the first (length).

The value returned for missing data is MaskedArray.mask, which is by default None. BitMaskedArray and IndexedMaskedArray inherit from MaskedArray, so setting MaskedArray.mask changes the return value for missing data globally.

MaskedArray

A MaskedArray is defined by two arrays and a boolean maskedwhen. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • mask: basic array of boolean dtype (default is MASKTYPE) with at least one dimension.

  • content: any array (default is a basic array of DEFAULTTYPE).

  • maskedwhen: boolean; element i is considered missing if mask[i] == maskedwhen (default is True).

The whole-array validity conditions are:

  • flattened mask length must be less than or equal to the content length.

The length of the MaskedArray is determined by the length of the mask array.

Masked arrays (all types) have the following read-only properties:

  • masked: boolean per byte array with the length of the array; True where values are masked, False where they are not (independent of maskedwhen).

  • unmasked: negation of masked.

Get-item behavior

When a masked array (any type) myarray is passed a selection in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections are passed down to a nested Table, if it exists.

For example,

myarray = awkward.MaskedArray([False, True, True, False],
              awkward.JaggedArray.fromiter([[1.1, 2.2, 3.3], [], [999], [4.4, 5.5]]))
myarray
# returns <MaskedArray [[1.1 2.2 3.3] None None [4.4 5.5]] at 7f5e1aceb7b8>
myarray[0]
# returns array([1.1, 2.2, 3.3])
myarray[1]
# returns None
myarray[myarray.isunmasked, 1:]
# returns <MaskedArray [[2.2 3.3] [5.5]] at 7f5e1acf0f60>

Mapped kernel behavior

If masked arrays (any type) are passed into a Numpy ufunc (or equivalent mapped kernel), values that are not masked in all inputs (including any non-masked arrays) are converted into IndexedMaskedArrays without padding before applying the ufunc. Unnecessary values do not enter the calculation.

For example,

a = awkward.MaskedArray([False, False, True, False, True], [1.1, 2.2, 3.3, 4.4, 5.5])
b = awkward.MaskedArray([False, True, True, False, False], [100, 200, 300, 400, 500])
a
# returns <MaskedArray [1.1 2.2 None 4.4 None] at 7f5e1aceb6d8>
b
# returns <MaskedArray [100 None None 400 500] at 7f5e1aceb710>
numpy.add(a, b)
# returns <IndexedMaskedArray [101.1 None None 404.4 None] at 7f5e1acf0f98>
numpy.add(a, b).content
# returns array([101.1, 404.4])

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus, the above could have been a + b.

Methods

MaskedArray and its subclasses (BitMaskedArray and IndexedMaskedArray) have the following methods:

  • boolmask(maskedwhen=None): return the mask as boolean bytes. If maskedwhen is None, use the instance’s maskedwhen. Otherwise, override it. (IndexedMaskedArray.boolmask has a default maskedwhen of True.)

  • indexed(): convert to an IndexedMaskedArray.

BitMaskedArray

A BitMaskedArray is defined by two arrays, a boolean maskedwhen, a boolean lsborder, and a shape parameter maskshape. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • mask: basic array with exactly one dimension; will be viewed as BITMASKTYPE.

  • content: any array (default is a basic array of DEFAULTTYPE).

  • maskedwhen: boolean; same meaning as in MaskedArray.

  • lsborder: boolean; if True, bits in mask are interpreted in LSB (least significant bit) order; if False, bits in mask are interpreted in MSB (most significant bit) order.

  • maskshape: None, a non-negative integer, or a tuple of positive integers (first may be zero); the sizes of the logical mask dimensions. If an integer, maskshape will be converted to (maskshape,). If None (the default), the maskshape will be assumed to be (len(content),). A value of None is persistent, so an unspecified maskshape scales with changes in content.

The whole-array validity conditions are:

  • The length of the BitMaskedArray must be less than or equal to the content length.

  • The length of the mask must be greater than or equal to 8 times the length of the BitMaskArray.

The length of the BitMaskedArray depends on maskshape: if None, the length is the content length. Otherwise, the length is maskshape[0].

Methods

In addition to methods defined in MaskedArray, a BitMaskedArray has the following static methods:

  • BitMaskedArray.bit2bool(bitmask, lsborder=False): converts one boolean per bit into one boolean per byte with a specified lsborder.

  • BitMaskedArray.bool2bit(boolmask, lsborder=False): converts one boolean per byte into one boolean per bit with a specified lsborder.

A BitMaskedArray may be created from one of the following alternate constructors.

BitMaskedArray.fromboolmask(mask, content, maskedwhen=True, lsborder=True, maskshape=None)

  • mask: one boolean per byte array; converted to one boolean per bit with BitMaskedArray.bool2bit(mask, lsborder=lsborder).

  • content: same as primary constructor.

  • maskedwhen: same as primary constructor.

  • lsborder: same as primary constructor.

  • maskshape: same as primary constructor.

IndexedMaskedArray

An IndexedMaskedArray is defined by two arrays. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • mask: a basic array of integer dtype (default is INDEXTYPE) with at least one dimension.

  • content: any array (default is a basic array of DEFAULTTYPE).

The whole-array validity conditions are:

  • maximum of mask (if non-negative) must be less than the content length.

The length of the IndexedMaskedArray is the length of the mask.

Indirection

Most programming environments have a concept of a “pointer” or “reference” that allows one object to be logically nested within another without being nested in the memory layout. The referenced object may be anywhere in memory and might not conform to the structure required of its type (depending on how strictly the language maintains type-safety). Completely general pointers cannot be emulated with arrays unless the entirety of a program’s memory were put into a single array. However, a limited form of indirection can be implemented through arrays of indexes.

As described in the types section, awkward-array allows the same data to appear in multiple parts of the data structure or even to contain themselves. In Python, awkward-arrays are Python instances whose members can be reassigned after construction, so nothing prevents an array from appearing in multiple parts of a structure or from containing itself.

To facilitate this kind of indirection, the IndexedArray class represents a delayed gather operation: it contains an array of indexes and a content array: extraction, slicing, masking, and gathering are filtered through the indexes before selecting contents. Its content could be itself, allowing the creation of graphs, though a JaggedArray or UnionArray in between would be needed to keep the graph finite.

IndexedArray acts as a bound for bounded pointers: part of a data structure with IndexedArray type can point to any element of the IndexedArray’s content. To bind pointers to more than one pool, combine them with UnionArray.

In a sense, a SparseArray is the opposite of an IndexedArray. A SparseArray contains logical indexes where the contents are not zero (or some other default) and content for each of those indexes, known as coordinate format (COO). Whereas logical element i of an IndexedArray is at content index index[i], content element j of a SparseArray is at logical index index[j]. An IndexedArray applies its index array as a function to obtain elements, a SparseArray inverts its index array as a function to obtain elements.

Since SparseArray must invert its index with every extraction, the index should be monatonically increasing (sorted). If a set of (index, content) pairs are known, they could be loaded into a SparseArray like this:

index, content     # coordinates as two equal-length arrays
order = numpy.argsort(index)
awkward.SparseArray(length, index[order], content[order])

IndexedArray and SparseArray both have the data type of their content — they are invisible at the type level, providing low-level features.

IndexedArray

An IndexedArray is defined by two arrays. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • index: basic array of integer dtype (default is INDEXTYPE) with at least one dimensions and all non-negative values.

  • content: any array (default is a basic array of DEFAULTTYPE).

  • dictencoding: boolean (default is False). If True, equality tests (== and != or numpy.equal and numpy.not_equal) do not propagate through to the content, but apply at the IndexedArray level and check for equality of the indexes. This makes IndexedArray usable as a dictionary encoding for categorical data.

The whole-array validity conditions are:

  • The maximum of index must be less than the length of content.

The length of an IndexedArray is the length of the index array.

Get-item behavior

When an indexed array myarray is passed a selection in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections are passed down to a nested Table, if it exists.

For example,

myarray = awkward.IndexedArray([2, 2, 1, 4], [0.0, 1.1, 2.2, 3.3, 4.4, 5.5])
myarray
# returns <IndexedArray [2.2 2.2 1.1 4.4] at 772e306077f0>
myarray[2]
# returns 1.1
myarray[2:]
# returns array([1.1, 4.4])

Here is another example, this one using a cyclic reference to build arbitrary depth trees.

myarray = awkward.IndexedArray([0],
              awkward.UnionArray.fromtags([1, 0, 1, 0, 1, 0, 0, 1], [
                  numpy.array([1.1, 2.2, 3.3, 4.4]),
                  awkward.JaggedArray([1, 3, 5, 8], [3, 5, 8, 8], [])]))   # the [] will be replaced
myarray.content.contents[1].content = myarray.content
myarray
# returns <IndexedArray [[1.1 [2.2 [3.3 4.4 []]]]] at 746bf6c422b0>
myarray[0, 1]
# returns <UnionArray [2.2 [3.3 4.4 []]] at 746bf6c422e8>
myarray[0, 1, 1]
# returns <UnionArray [3.3 4.4 []] at 746bf6c42390>
myarray[0, 1, 1, 2]
# returns array([], dtype=float64)

The depth of this tree is not a function of the depth of the IndexedArray of UnionArray of basic and JaggedArray that built it. The depth of this tree is a function of the values of the index array, the tags array, and the starts/stops arrays. This construction is a purely columnar tree of numbers and sub-trees.

If dictencoding is True, the equality tests (== and != or numpy.equal and numpy.not_equal) do not propagate through to the content, but apply at the IndexedArray level and check for equality of the indexes.

Mapped kernel behavior

If indexed arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the delayed gather is applied before computing the result. This even works in arbitrarily nested cases, like the last examples in the previous section.

numpy.sum(myarray, 10)
# returns <JaggedArray [[11.1 [12.2 [13.3 14.4 []]]]] at 746bf6c42400>

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been myarray + 10.

SparseArray

A SparseArray is defined by a shape, two arrays, and a default element. Below are their single-property validity conditions. The arrays may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • indexshape: non-negative integer or a tuple of positive integers (first may be zero); the sizes of the logical dimensions. If an integer, indexshape will be converted to (indexshape,).

  • index: basic array of integer dtype (default is INDEXTYPE) with exactly one dimension and all non-negative values. This array must be monatonically increasing (sorted).

  • content: any array (default is a basic array of DEFAULTTYPE).

  • default: None or any value. If None, an appropriate zero will be generated:

    • content.dtype.type(0) if content is a 1-dimensional basic array;

    • numpy.zeros(content.shape[1:], content.dtype) if content is a multidimensional basic array;

    • empty jagged array if content is a jagged array;

    • the masked value if content is a masked array;

    • None if content is an object array;

    • an empty string if content is a string array;

    • the first basic array zero if content is a union array; the first other type if the union has no basic arrays;

    • a Table.Row of defaults if content is a table;

    • a decision based on the content of any other type.

The whole-array validity conditions are:

  • flattened index length must be less than or equal to the content length.

The length of the SparseArray is determined purely by the indexshape.

Get-item behavior

When a sparse array myarray is passed a selection in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections are passed down to a nested Table, if it exists.

For example,

myarray = awkward.SparseArray(1000, [101, 102, 105, 800], [1.1, 2.2, 3.3, 4.4])
myarray
# returns <SparseArray [0.0 0.0 0.0 ... 0.0 0.0 0.0] at 7131e4b9a438>
myarray[100:106]
# returns <SparseArray [0.0 1.1 2.2 0.0 0.0 3.3] at 7131e4b9a518>
myarray[798:803]
# returns <SparseArray [0.0 0.0 4.4 0.0 0.0] at 7131e4b9a550>

Mapped kernel behavior

If sparse arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is computed for all non-default values and separately for the default value, blending the results as a UnionArray.

For example (reusing myarray from the previous section),

numpy.add(myarray, 10)[100:106]
# returns <UnionArray [10.0 11.1 12.2, 10.0 10.0 13.3] at 746bf6c41800>

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been (myarray + 10)[100:106].

Helper functions

The awkward.array.indexed submodule may define helper functions, such as the following.

  • invert(permutation): returns inverse such that inverse[permutation] == numpy.arange(len(permutation)) is the identity. (If permutation contains all values from 0 to len(permutation) - 1, it is also the case that permutation[inverse] == numpy.arange(len(permutation)).) If not all values in permutation are distinct, this function raises an error.

Opaque objects

The array types defined above are sufficient to create rich data types — most of the types expected in a general programming environment. With columnar layouts in memory, they take a minimum of space and regular operations can be applied on them very quickly. However, all of these are awkward-array types: only Numpy ufuncs and Python get-item know how to operate on them. Situations will arise in which types must satisfy third-party constraints.

Data structures built by combining awkward-arrays are constructive (built by construction), instances of other types are opaque (not known to the awkward-array library). To emulate an array of opaque objects, we wrap it in an ObjectArray that applies a function to an element i to generate the object at i. The object must be a pure function of the data at element i and not maintain long-lived state.

Get-item selections and mapped kernels perform vectorized operations across all or much of the array, and if the object type has methods, users may want to apply the methods as vectorized operations as well. Instantiating all elements in the array and invoking the method on all of them misses the point (one might as well use a Python list or a Numpy object array), so there is an alternate way to apply them: as vectorized operations on the data used to generate the objects.

Here is a motivating example: a Table of floating point "x" and "y" columns is wrapped in an ObjectArray with a Point constructor to effectively make an array of user-defined Point objects. Point instances have an angle method the computes math.atan2(self.y, self.x). Users want to compute the angle of all values in the array without constructing Point for each. We therefore add a method angle to ObjectArray that computes numpy.arctan2(self["x"], self["y"]).

These methods are added with a mix-in facility that accepts any class containing pure-function methods (no persistent state) and has no init method. This is where different languages will put the most constraint on what can be done. Mix-ins are equivalent to Java’s Interfaces, but in a statically compiled language, methods can’t be added at runtime. In Java in particular, classes can be created from mix-ins in a nested ClassContext, but methods from these runtime types can’t be used in the main ClassContext code because it has already been type-checked. Code that uses the new methods must be compiled after the mix-ins, which means that it must be compiled on the fly. In C++, a just-in-time compiler like Cling would be needed.

A library may be called compliant with awkward-array if it lacks the ability to add mix-in methods.

An important use of ObjectArray and mix-in methods is StringArray, which implements an array of strings as a JaggedArray of CHARTYPE, generating str or bytes objects upon extraction. It is important (for users) that the objects drawn from this array have the native string type of whichever language they’re using. It’s also important to have some vectorized methods, like dropping the last character of all strings (which can actually be a shift to the JaggedArray’s stops array). StringArray has its mix-in methods built-in, so it does not suffer the dynamic vs. static issue described above.

Although Numpy can store strings in arrays, its rectangular model requires strings to be padded to the length of the longest string in an array. StringArray takes advantage of JaggedArray’s efficient encoding of variable-length contents to store variable-length strings.

Mix-in Methods

For a class to be eligible as a mix-in, it must not have an init method and must not modify self in any of its methods. Mix-ins can be added to a class by inheritance or to an instance (in Python) by changing an object’s class attribute. Convenience functions are provided in Methods, which is a container of static methods:

  • mixin(methods, awkwardtype): given a methods class (the mix-ins) and an awkwardtype (the awkward-array class object, like JaggedArray or ObjectArray), this returns an array class object with the methods added. This class object can be constructed like the corresponding awkward-array, or it may be assigned to an existing instance’s class attribute.

  • maybemixin(samples, awkwardtype): given a samples object (an array that might have mix-ins) or list (arrays that might have mix-ins) and an awkwardtype (the awkward-array class object, like JaggedArray or ObjectArray), this returns an array class object with any mix-ins any of the samples might have (union of all mix-in methods, in Python subclassing order). It is used to transfer mix-in methods from one array to another.

Mix-in methods are automatically transferred in the following situations:

  1. When processing a Numpy ufunc (or equivalent mapped kernel), which includes unary and binary operations like + and -, all mix-in methods of the arguments are transferred to the output.

  2. When selecting a column from a Table, including selections through a nested contents (e.g. jaggedtable["x"]), the mix-in methods of the table column apply to the output, but the mix-in methods of the original container (e.g. jaggedtable) do not apply.

  3. When slicing, masking, or gathering through an array’s get-item (but not extracting!), the array’s mix-ins are retained in the output.

In all other operations, such as reductions and other methods, mix-ins are not carried through.

ObjectArray

An ObjectArray is defined by an array and a generator function with arguments. Below are their single-property validity conditions. The array may be generated from any Python iterable, with the default type chosen in the case of an empty iterable.

  • content: any array (default is a basic array of DEFAULTTYPE).

  • generator: function that produces object i from content[i].

  • args: a tuple of constant positional arguments to pass to generator. If not a tuple, it will be converted to (args,).

  • kwargs: a dict of constant keyword arguments to pass to generator. If not a dict, an error will be raised. The given dict is shallowly copied to avoid referencing issues.

  • dims: a positive integer (default is 1); the number of dimensions in the ObjectArray.

The whole-array validity conditions are:

  • dims must be less than or equal to len(content.shape).

The length of the ObjectArray is the length of content, and the shape of the ObjectArray is content.shape[:dims].

Get-item behavior

When an object array myarray is passed a selection in square brackets, it obeys the usual rules for all operations except extraction: a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray performs masking, and a 1-dimensional list or array of integers performs a gather operation. An integer, however, extracts from content and calls

generator(content[i], *args, **kwargs)

on the result to return an output. If dims > 1, the first dims - 1 elements of a given tuple are passed through content (so that an ObjectArray may be multidimensional) and then element dims - 1 of the tuple is run through the generator function. Any remaining elements of a given tuple are applied to the output of that generator.

For example,

class Point(object):
    def __init__(self, row):
        self.x, self.y = row["x"], row["y"]
    def __repr__(self):
        return "<Point {0} {1}>".format(self.x, self.y)

myarray = awkward.ObjectArray(awkward.Table(x=[1.1, 2.2, 3.3], y=[10, 20, 30]), Point)
myarray
# returns <ObjectArray [<Point 1.1 10> <Point 2.2 20> <Point 3.3 30>] at 7779705f4860>
myarray[1:]
# returns <ObjectArray [<Point 2.2 20> <Point 3.3 30>] at 7779705f49b0>
myarray[1]
# returns <Point 2.2 20>
myarray[1].y
# returns 20

Mapped kernel behavior

If object arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is computed on the contents and the output is re-wrapped as object arrays. This might not be the intended semantics for the objects; if so, overload them with mix-in methods. (The mix-in should define array_ufunc, described in the Numpy docs and as a NEP.)

Using the class from the previous example,

a = awkward.ObjectArray(awkward.Table(x=[1.1, 2.2, 3.3], y=[10, 20, 30]), Point)
b = awkward.ObjectArray(awkward.Table(x=[10, 20, 30], y=[100, 100, 100]), Point)
numpy.add(a, b)
# returns <ObjectArray [<Point 11.1 110> <Point 22.2 120> <Point 33.3 130>] at 7aea8ce5a358>

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been a + b.

StringArray

A StringArray is an ObjectArray with awkward.array.objects.StringMethods mix-ins. Its content is an internal JaggedArray and it accepts JaggedArray constructors. Its primary constructor parameters are:

  • starts: same as JaggedArray.starts except that it will apply to byte positions in `content.

  • stops: same as JaggedArray.stops except that it will apply to byte positions in

  • content: same as JaggedArray.content except that it will be viewed as CHARTYPE.

  • encoding: None (for bytes) or an encoding name (for str). Default is "utf-8". This property must be assigned with None or an encoding name but its value is None or a decoder function from codecs.getdecoder. (If the encoding name is not recognized, an error is raised.)

A StringArray has the same whole-array validity conditions as JaggedArray.

The length and shape of StringArray are the length and shape of starts.

StringArray has the same alternate constructors as JaggedArray: fromiter, fromoffsets, fromcounts, fromparents, fromuniques, and fromjagged, except that the content is always required to be or interpreted as CHARTYPE. StringArray additionally has the following constructors:

  • StringArray.fromstr(shape, string): duplicates a single str or bytes object to fill an array with a given shape (may be a non-negative integer).

  • StringArray.fromnumpy(array): converts a Numpy string array into a StringArray.

As an ObjectArray with an implicit generator of awkward.array.objects.tostring an implicit args of (encoding,), and an implicit dims of len(starts.shape), a StringArray returns a bytes or str for each item.

All Numpy ufuncs (or equivalent mapped functions) apply mathematical operations on the characters of the strings as though they were uint8 integers, except for equality tests (== and != or numpy.equal and numpy.not_equal), which are overloaded in awkward.array.objects.StringMethods to compute string equality.

Non-contiguousness

Many array sources are non-contiguous, usually so that they can be read in releatively small, memory-friendly chunks (e.g. ROOT baskets or Parquet pages). However, a basic array library like Numpy expects its arrays to be fully contiguous in memory, and that can usually only be achieved by copying data.

However, just as we wrap arrays in classes to give them new logical structure, we can wrap a sequence of arrays as a ChunkedArray to view it as though it were a concatenated version of those arrays. The arrays in the sequence all need to have the same high-level type, but they don’t all need to have the same low-level structure. Some may be basic arrays and others IndexedArrays to correspond to pages that alternate between a simple encoding and a dictionary encoding. The high-level type of the ChunkedArray is the same as the high-level type of its chunks.

To extract an element at index i, it is necessary to know the length of all chunks up to and including the one in which index i resides, but getting this information might be an expensive operation. Therefore, ChunkedArray does not require this information up-front, but requests it and retains it as higher indexes are requested. Its string representations (str and repr in Python) only show the first few elements and not the last if not all of the counts are known.

A non-contiguous array interface makes it possible to efficiently append rows to an array. Instead of copying a whole array into a larger allocation with each append, we can allocate a chunk, fill it by writing to it and increasing its “end” pointer, then allocate a new chunk when it is full. Since we can address non-contiguous data as a single array, we never have to copy partial results to concatenate. AppendableArray is an array with appendable rows, and is one of the only two mutable array types in awkward-arrays: AppendableArray can add new rows and Table can add, overwrite, and remove columns.

ChunkedArray

A ChunkedArray is defined by a list of chunks (arrays) and a list of counts (non-negative integers). Below are their single-property validity conditions. The arrays in chunks may be generated from any Python iterable, with default types chosen in the case of empty iterables.

  • chunks: a Python list of any array (defaults are basic arrays of DEFAULTTYPE).

  • counts: a Python list of non-negative integers. Default is [].

The whole-array validity conditions are:

  • chunks length must be greater than or equal to counts length.

  • Each count (non-negative integer in counts) must be equal to the length of the corresponding chunk (item in chunks).

  • All non-empty chunks must have the same high-level type as the first non-empty chunk.

ChunkedArray fills its counts as they become known, strictly from first to last. As a public property, these are visible to the user. ChunkedArray may also internally cache types as they become known (in any order), to avoid repeated queries.

A ChunkedArray has the following read-only properties and methods:

  • countsknown: True if counts has the same length as chunks; False otherwise.

  • typesknown: True if all types are internally cached; False otherwise. If a ChunkedArray does not cache types, this property may be omitted.

  • knowcounts(until=None): request and cache the lengths of chunks up to and not including until, or up to the end if until is None.

  • knowtype(at): request and cache the type of chunk at. If a ChunkedArray does not cache types, this property may be omitted.

  • global2chunkid(index, return_normalized=False): convert a ChunkedArray index to the chunk id in which it resides. (chunks[i] is the chunk at id i, etc.) The index may be an integer or a 1-dimensional array of integers for a gather operation. Negative indexes are normalized to count from the end of the ChunkedArray. If return_normalized is True, the output is a 2-tuple: the chunk id and the index normalized to count from the end of the ChunkedArray.

  • global2local(index): convert a ChunkedArray index to the corresponding chunk and its local index in the chunk. The index may be an integer or a 1-dimensional array of integers for a gather operation. If so, then the chunk output is a Numpy object array of chunks.

  • local2global(index, chunkid): convert a local chunk index and its chunk id to a global ChunkArray index. The index may be an integer or a 1-dimensional array of integers for a gather operation.

Get-item behavior

When a chunked array myarray is passed a selection in square brackets, it obeys the usual rules: an integer performs extraction, a slice performs slicing, a 1-dimensional list or array of booleans with the same length as myarray performs masking, and a 1-dimensional list or array of integers performs a gather operation. Tuples perform these operations in multiple dimensions. String selections are passed down to a nested Table, if it exists.

Touching elements can affect which counts are known and therefore the string representation of the array. For example,

myarray = awkward.ChunkedArray([[0, 1, 2], [], [3, 4], [5, 6, 7, 8], [9]])
myarray
# returns <ChunkedArray [0 1 2 3 4 5 6 ...] at 7f778daed7f0>
myarray[-1]
# returns 9
myarray
# returns <ChunkedArray [0 1 2 ... 7 8 9] at 7f778daed7f0>

Mapped kernel behavior

If chunked arrays are passed into a Numpy ufunc (or equivalent mapped kernel), the ufunc is computed iteratively on chunk sizes determined by the first chunked array argument, and the return value is a ChunkedArray with that structure.

For example (reusing myarray from the previous section),

numpy.add(myarray, 0.1)
# returns <ChunkedArray [0.1 1.1 2.1 3.1 4.1 5.1 6.1 ...] at 7f778daeda20>
numpy.add(myarray, 0.1).chunks
# returns [array([0.1, 1.1, 2.1]), array([], dtype=float64), array([3.1, 4.1]),
#          array([5.1, 6.1, 7.1, 8.1]), array([9.1])]

Unary and binary operators corresponding to mapped kernels should have the same behavior. Thus the above could have been myarray + 0.1.

AppendableArray

An AppendableArray is a ChunkedArray of primitive type that can be efficiently appended. Below are the single-property validity conditions. The arrays may be generated from any Python iterable, with dfault types chosen in the case of empty iterables.

  • chunkshape: positive integer or a tuple of positive integers defining the allocated shape of each chunk.

  • dtype: Numpy dtype of the content.

  • chunks: a Python list of basic arrays (default type DEFAULTTYPE).

The counts parameter is read-only and internally managed. In ChunkedArray, the counts must be exactly equal to the length of each chunk, but in AppendableArray, the last count is less than or equal to the length of the last chunk because not all of the allocated chunk may be filled with valid data. Uninitialized data may be visible to the user through chunks[-1], but not through get-item and mapped kernels on the AppendableArray itself.

The whole-array validity conditions are the same as for ChunkedArray, except that counts is not required to be equal to the length of each chunk.

AppendableArray has the following special methods:

  • append(value): add one value at the end of the array.

  • extend(values): add multiple values to the end of the array.

Laziness

Often, datasets are too large to entirely load into memory or too large to load up-front. Many data-loading libraries offer the ability to load parts of a file or dataset as needed. However, the decisions about when to load data, how much to load, and what to cache are system-dependent, and we might instead want them to be encoded in the array structure itself, so awkward-array has a VirtualArray class to represent an array that might or might not be in memory, but will be when asked.

Laziness and non-contiguousness are closely related. If a Table is too big to load but its columns of interest are not, then we may want a Table of VirtualArrays, so that each entire column is loaded when touched. However, if a single column is too big to load, then delaying that operation with a VirtualArray is not enough: we need a ChunkedArray of VirtualArrays to load chunks of rows at a time.

Laziness and caching are closely related. If all the data needed for a process is too large to hold in memory, then lazily loading each section and keeping it forever is not enough: we need the loaded data to be evicted when we’re done with it. If the VirtualArray instance goes out of scope, then Python’s garbage collector does that automatically. If not, then the VirtualArray must let its loaded data be managed by a cache with explicit eviction rules.

Most cache implementations in Python have a dict-like interface. If it is process-bound, then transient keys based on the Python id of the VirtualArrays. If it is not, then permanent identifiers must be assigned somehow.

If absolutely no caching is desired, then a Python MutableMapping with a do-nothing setitem would act as an immediately forgetful cache (with transient keys).

A Dask delayed array is the equivalent of a ChunkedArray of VirtualArrays, for which all of the chunked array’s counts are known.

VirtualArray

A VirtualArray is defined by a generating function, not any arrays. Below are the single-property validity conditions for all of its primary constructor arguments.

  • generator: a callable that produces the array. It must accept arguments as given by args and kwargs as generator(*args, **kwargs).

  • args (default ()): a tuple of arguments for the generator. If not a tuple, it will be converted to (args,).

  • kwargs (default {}): a dict of keyword arguments for the generator. If not a dict, an error will be raised. The given dict is shallowly copied to avoid referencing issues.

  • cache (default None): None for no cache or a dict-like object to use as a cache.

  • persistentkey (default None): None to use transient keys in a cache or a string to use as a key in a persistent cache.

  • type (default None): None or high-level type of the array to use before materializing it. If None, any query that requires type knowledge, such as asking for the length of the array, would cause the array to be materialized.

  • persistvirtual (default True): if True, persist this object as a virtual array, meaning that its data are not stored in the serialized form. If the VirtualArray depends on the existence of a file at a given path, for instance, the serialized form can’t be deserialized on a system without that file at that path. If False, persist this object as a concrete array, so that everything needed to reconstruct the data is stored in the serialized form.

There are no whole-array validity conditions in the normal sense, but if the type parameter is not None and the materialized array has a different type, an error is raised at that time.

VirtualArray has the following read-only properties and methods:

  • ismaterialized: True if the array has been loaded and False if it has not.

  • materialize(): cause the array to be loaded.

If type is None, then attempts to get the VirtualArray length, type, dtype, shape, etc. will cause the array to be materialized. In any case, an attempt to get-item or use the array in a Numpy ufunc (or equivalent mapped kernel) will cause the array to be materialized.

If cache is None, then the materialized array is internally cached in the VirtualArray object itself. To delete the array, it would be necessary to delete the VirtualArray.

If cache is not None and persistentkey is None, then the array is placed in the cache and a VirtualArray.TransientKey is used as the key. The transient key is guaranteed to be globally unique in the Python process as long as the VirtualArray exists. If the VirtualArray is deleted, its del method attempts to delete its transient key from the cache because its global uniqueness can no longer be guaranteed. However, this is fragile because the cache might have been changed for another cache, the del method might not be called before another Python object uses the VirtualArray’s Python id, etc. Generally, transient keys should be used when the VirtualArray objects are known to be long-lived. (If they are short-lived, setting cache to None and letting the Python garbage collector manage eviction would be a better policy.) If the cache only accepts strings as keys, the VirtualArray.TransientKey has a unique str representation.

If cache is not None and persistentkey is not None, then persistentkey will be used as the key for the cache. The burden of ensuring uniqueness is on the user, and the user will have to decide whether the key needs to be process-unique, machine-unique, or unique in some distributed sense.

VirtualArray maintains an internal list of columns added, overwritten, or deleted to or from any internal Tables. If the generated array is ever lost due to cache eviction and needs to be regenerated, these modifications will be replayed so that the apparent content maintains its state. Also, if persistvirtual is True and the generated array is not written to a serialized form, the modifications are written to the serialized form, and will be replayed when reconstructed from that serialized form.

Row-wise to columnar conversion

An awkward-array library should provide a general facility for creating columnar arrays from row-wise data. This would often be the first step before processing. In a statically typed language, the array structure can be fully determined before filling, but in a dynamically typed language, the array types and nesting structure would have to be discovered while iterating over the data.

The interface has two entry points:

  • fromiter(iterable, **options): iterate over data and return an array representing all of the data in a columnar form. In a statically typed language, the type structure would have to be provided somehow, and this function would raise an error if the data do not conform to that type (if possible).

  • fromiterchunks(iterable, chunksize, **options): iterate over data and yield arrays of length chunksize or less for the last array. This function never yields an empty array, but may yield no arrays. If types are dynamically discovered, this function raises an error if a chunk’s type differs from the first chunk. (The chunksize must be chosen large enough to fully discover the type in the first chunk.) In a statically typed language, the same type considerations apply as above.

The following options are recognized:

  • dictencoding: boolean or a function returning boolean. If True, all arrays of bytes/strings are dictionary encoded as an IndexedArray of StringArray. If False, all arrays of bytes/strings are simply StringArray. If a function, this function is called on the list of bytes/strings to determine if it should be dictionary encoded.

Types derived from the data resolve ambiguities by satisfying the following rules.

  1. Boolean types are distinct from numbers, but mixed numeric types are resolved in favor of the most general number at a level of the hierarchy. (That is, a single floating point value will make all integers at the same nesting depth floating point.) Booleans and numbers at the same level of hierarchy would be represented by a UnionArray.

  2. In Python (which makes a distinction between raw bytes and strings with an encoding), bytes and strings are different types. Bytes and strings at the same level of hierarchy would be represented by a UnionArray.

  3. All lists are presumed to be variable-length (JaggedArray).

  4. A mapping type (Python dict) is represented as a Table, and mappings with different sets of field names are considered distinct. Mappings with the same field names but different field types are not distinct: they are Tables with some UnionArray columns. Missing fields are different from fields with None values. Mappings must have string-typed keys: other types are not supported.

  5. An empty mapping, {}, is considered identical to None.

  6. Table column names are sorted alphabetically.

  7. An object type may be represented as an ObjectArray or it may not be supported. The Numpy-only implementation generates ObjectArrays from Python class instances and from namedtuples.

  8. Optional and sum types are in canonical form: MaskedArrays are not nested directly within MaskedArrays and UnionArrays are not nested directly within UnionArrays. If a given level of hierarchy is both masked and heterogeneous, the MaskArray is outside the UnionArray and none of the union array’s contents are masked.

  9. Tables, ObjectArrays, and UnionArrays are masked with IndexedMaskedArray, while all others are masked with MaskedArray.

  10. If a type at some level of hierarchy cannot be determined (all lists at that level are empty), then the type is taken to be DEFAULTTYPE.

Serialization

The purpose of awkward-array is to perform computations on complex data with an array programming idiom, not as a storage or interchange format to rival ROOT, Parquet, or Arrow. However, since the bulk data consist entirely of basic (Numpy) arrays, any storage format that can store a set of named arrays and the awkward metadata can store one or more awkward-arrays.

We define a protocol to store any awkward-array structure in any named array container so that users have a way to stash and share intermediate results and, hopefully, to communicate between awkward-array implementations. The definition consists of a JSON format with as little reference to Python specifics as possible, so that implementations of awkward-array in other languages would be able to share data. This sharing could be through pointer locations in a process, shared memory on a single operating system, files on disk, or a distributed network.

This persistence protocol does not define a method for storing named arrays. We assume that an existing mechanism, such as ROOT, Parquet, or Arrow, or ZIP, HDF5, named files, or an object store manage that. For complete generality, the “named arrays” do not need to keep array metadata (the way HDF5 files do), but can store only named binary blobs (the way ZIP files do). We only fill this container with array data, assign names to them, and additionally provide a JSON object that can be used to reverse the process.

The awkward-array library in Python uses this serialization mechanism to pickle array objects, save them to and load them from ZIP files, and to use HDF5 as a read-write database of awkward arrays. ZIP files of awkward-array objects are identified with file extension .awkd.

JSON metadata format

The JSON object is a nested expression to be interpreted as a command, like a LISP S-expression. This provides the most flexibility for backward and forward compatibility. For security, deserialization takes a whitelist of allowed functions: maliciously constructed data cannot execute arbitrary functions, only the ones in the whitelist. If a new feature is added that requires the execution of previously unforeseen functions, users may need to manually expand the whitelist for forward compatibility (or they may completely open the whitelist if they trust the data source).

A single JSON object represents a single awkward-array. Multiple awkward-arrays may be saved in the same namespace if they each have a different prefix. In general, it would be the user’s responsibility to ensure that prefixes don’t overlap, but the Numpy-only implementation of awkward-array checks for name collisions when writing to ZIP files and separates namespaces with Groups in HDF5 files. Prefixes can also be filesystem paths up to a directory delimiter — arrays would not overlap if they are saved to different directories.

Top level object

The top level of the JSON format defines the following fields. If a field has a default, it is optional. If it isn’t labeled with a default or as optional, it is required.

  • awkward (required, but for information purposes only): the version of the awkward-array library that wrote this object. Only the existence of this field is checked as a format signature.

  • doc (optional, for information purposes only): should be a string

  • metadata (optional, for information purposes only): any JSON structure

  • prefix (default is ""): the prefix at the beginning of the name of each binary blob.

  • schema: the deserialization expression, which recursively defines the awkward-array to build.

If additional fields are filled, the object is not considered invalid, though it risks conflicts with future versions of this specification.

Deserialization expressions

A deserialization expression is a JSON object whose type is determined by the presence of an identifying key. If additional fields beyond the ones described below are filled, the object is not considered invalid, though it risks conflicts with future versions of this specification. However, a deserialization expression must not include more than one identifying key in the same JSON object.

Every deserialization expression has an "id" field, which is a non-negative integer. These identifiers are used by "ref" to build cross-references and cyclic references into the arrays within the object. These identifiers are not guaranteed to be in order; a reference to an unrecognized identifier can be deserialized as a VirtualArray to break a cyclic dependency.

Each of the sections below is labeled with the identifying key first.

{"call": function, "args": [expressions...]}

The expression is a function to be called if its function specification matches a pattern in the whitelist.

  • call: function specification, described below.

  • args (default is []): list of [deserialization expressions] to use as positional arguments.

  • kwargs (default is {}): JSON object of [deserialization expressions] to use as keyword, argument pairs.

  • cacheable (default is false): if true, pass the cache as a cache keyword argument to the function.

  • whitelistable (default is false): if true, pass the whitelist as a whitelist keyword argument to the function.

Functions are specified by a list of at least two strings. The first is a fully qualified module path and rest are objects or objects within objects in that module that terminate in a callable.

For example, ["awkward", "JaggedArray", "fromcounts"] is the fromcounts constructor of class JaggedArray in module awkward.

{"read": name}

The expression is a binary blob to read with a given name.

  • read: string; the name of the binary blob to read.

  • absolute (default false): if true, the name is used as-is, if false, the global prefix must be prepended before the name.

{"list": [expressions...]}

The expression is a list of [deserialization expressions].

{"tuple": [expressions...]}

The expression is a tuple of [deserialization expressions].

{"dict": {name: expression, ...}}

The expression is a dict of name (string), deserialization expression pairs.

{"pairs": [[name, expression], ...]}

The expression is a list of name (string), deserialization expression pairs. It is not a JSON object so that the order is maintained.

{"dtype": dtype}

The expression is a Numpy dtype, expressed as JSON. Numpy distinguishes between Python tuples and lists, so the following patterns must all be converted to tuples before passing the result to the numpy.dtype constructor:

  • [..., integer]

  • [..., [all integers]]

  • [string, ...]

{"function": function-specification}

The expression is a function object to pass as an argument, not to call. The function specification is the same as in the "call" type, though.

{"json": data}

The expression is purely expressed by data, an arbitrary JSON value.

{"python": data}

The expression is encoded in data in a way that only Python can decode. It is a base-64 encoding of a pickled object.

{"ref": id}

The expression is a reference to an array defined elsewhere in the object.

Persistence configuration and signatures

Whitelist specification

The function whitelist is globally defined in awkward.persist.whitelist but it can also be passed into deserialization functions manually. The format is a list of function specifiers with glob-style wildcards. Function specifiers, as described in the "call" type, are a fully qualified module name followed by a path of objects within objects leading to a callable, like

["awkward", "JaggedArray", "fromcounts"]

for the fromcounts constructor of the JaggedArray class in the awkward module. Whitelist specifiers allow fnmatch wildcards, like

["awkward", "*Array", "from*"]

to allow any array type’s non-primary constructor. A single string is promoted to a specifier and a single specifier is promoted to a list of specifiers, so "*" by itself is a valid whitelist for allowing any function to run (for trusted data).

A function name that satisfies any wildcard expression is allowed.

awkward.persist.topython should not be in a default whitelist because unpickling untrusted data can call arbitrary Python functions.

An awkward-array library’s default whitelist is not defined in this specification.

Compression policy

When serializing, users have the option to compress basic (Numpy) arrays. (When deserializing, whatever decompression functions are found are executed, if they are in the whitelist.) Compression has more value for some kinds of arrays than others, so the decision to compress or not compress is parameterized as in a policy.

The default policy is globally defined in awkward.persist.compression as a list of rules. Each rule has the following format:

  • minsize: minimum size in bytes; if the basic array is smaller than this size, it is not compressed by this rule.

  • types: list of item types; if the basic array’s item type is not a subclass of one of these types, it is not compressed by this rule.

  • contexts: fnmatch wildcard string or list of such strings; if the basic array’s context (what parameter it belongs to in an awkward-array) does not match any of these patterns, it is not compressed by this rule.

  • pair: 2-tuple of compression function, decompression function specifier. The compression function is a Python callable, which turns a buffer into compressed bytes. The decompression function is a tuple of strings naming the module and object where the function may be found (as in the "call" type). Whereas the compression function is needed right away, the decompression function need only be specified so that it can be called during deserialization. The compression and decompression functions should be strict inverses of one another, with no parameters needed except the buffer to compress or decompress.

A single pair is promoted to a rule and a single rule is promoted to a list of rules, so it would be sufficient to pass compression=(zlib.compress, ("zlib", "decompress")) to a serialization function. If the compression pair is in the awkward.persist.partner dict, only the compression function is needed: compression=zlib.compress.

Persistence functions

serialize(obj, storage, name=None, delimiter="-", suffix=None, schemasuffix=None, compression=compression, **kwargs)

serializes obj (an awkard-array) and puts all binary blobs into storage using names derived from a name prefix, separated by delimiter. Binary blobs optionally may have a suffix (such as ".raw") and the schema itself (also inserted into storage as a binary blob) may have a schemasuffix (such as ".json"). The compression option is described above. No return value.

deserialize(storage, name="", whitelist=whitelist, cache=None)

returns an awkward-array from storage using name as a prefix and an exact name for finding the schema. The whitelist option is described above. If a cache is passed, that cache is passed as an argument to every VirtualArray.

save(file, array, name=None, mode="a", **options)

Save an array (an awkward-array) into a file specified as a name, a path, or as a file-like object. If it is a name that does not end in .awkd, this suffix is appended. The mode is passed to the zipfile.ZipFile object; "a" means append to an existing file or create a file. If there are name conflicts, an erorr is raised before writing anything to the file. No return value.

load(file, **options)

Open file as a read-only dict-like object containing awkward-arrays. The arrays may be found by asking for the dict-like object’s keys and extracted with get-item.

hdf5(group, **options)

Interpret an HDF5 file or group (from the h5py library) as containing awkward-arrays, rather than arrays. Low-level binary blobs are hidden in favor of logical awkward arrays. This object can be written to or read from as a dict-like object with get-item, set-item, and del-item.