Skip to content

Commit

Permalink
docs: Add documentation for array types (#110)
Browse files Browse the repository at this point in the history
This commit adds documentation for the newly-added `array` types.

In particular, the following aspects were documented:
 - type syntax
 - literal syntax
 - element nullability
 - casting behavior
 - array functions
 - limitations
 - external format support
  • Loading branch information
iamlucaswolf authored Aug 7, 2024
1 parent a5efdc6 commit ae11197
Show file tree
Hide file tree
Showing 5 changed files with 219 additions and 5 deletions.
105 changes: 105 additions & 0 deletions website/docs/sql/datatype/array.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Array Type

An array is a sequential collection of elements packed into a single SQL value.
Arrays are useful to model dependent information with an inherent ordering in a single row – for example, time-series data from longitudinal studies, or embedding vectors from machine learning models.

Array Types are created using the following syntax:

```
array(<atomic type> [not null])
```

`<atomic type>` may be any of Hyper's [atomic data types](./index.md#atomic-types).
The optional `not null` specifier indicates whether array elements are allowed to be `null`.
For example, `array(integer not null)` denotes an array of non-null integer elements.

All array types in Hyper have the following properties:
- Arrays have variable length. Within a column, arrays are not required to have the same number of elements.
- Arrays are one-dimensional. Higher-dimensional objects (e.g., matrices) must be flattened explicitly.

For a comprehensive overview of supported operations on arrays, see the section on [Array Functions](../scalar_func/arrays.md).

## Creating Arrays

Arrays can be created in two ways:

- Using the type constructor syntax
```sql
> select array[1,2,3];
{1,2,3}

> select array['one','two','three'];
{one,two,three}
```

The constructor syntax consists of the keyword `array` followed by a comma-separated list of element SQL values surrounded by square brackets `[...]`.

Using this syntax, the array type will be inferred automatically.
If the given list of elements does not contain a `null` value, the [element type](#element-types-and-nullability) will be inferred as non-nullable.

- Using a [cast](../scalar_func/conversion.md) from string data
```sql
> select '{1,2,3}'::array(integer);
{1,2,3}

> select '{one,two,three}'::array(text);
{one,two,three}
```

An array string literal consists of a comma-separated list of element literals surrounded by curly braces `{...}`.
The element literal syntax is identical to that of the respective atomic type.

Note that for string array types (such as, e.g., `array(text)`), any upper- or lower-case variant of the element literal `null` will be parsed as a `null` value.
To specify the string `null`, the element literal must be escaped using double quotes, like so:
```sql
> select '{null, "null"}'::array(text)
{NULL,null} # A null element, followed by the string 'null'
```

## Element Types and Nullability

Hyper's arrays are _strongly typed_:
All elements must be of the same type – the array's _element type_.
The element type is a defining part of the array's overall type, meaning that `array(integer)` is a different type than `array(boolean)` or `array(text)`.
Arrays can be built from all [atomic types](./index.md) available in Hyper.

Part of an array's element type is the element's nullability.
For example, `array(smallint)` is different from `array(smallint not null)`.
Note that this is independent of the nullability of the array itself.
The following four options all represent different types in Hyper:

|Type|array nullable?|elements nullable?| possible values|
|---|---|---|---|
|`array(integer)`|||`{}`,`{1,2,3}`,`{1,2,null}`, `null`|
|`array(integer not null)`|||`{}`,`{1,2,3}`,`null`|
|`array(integer) not null`|||`{}`,`{1,2,3}`,`{1,2,null}`|
|`array(integer not null) not null`|||`{}`,`{1,2,3}`|

The inner nullability of an array type can be changed by casting, using the conventional [cast syntax](../scalar_func/conversion.md):

```sql
# nullable to non-nullable
> select ('{1,2,3}'::array(integer))::array(integer not null)
# non-nullable to nullable
> select ('{1,2,3}'::array(integer not null))::array(integer)
```

A cast from a non-nullable element type to its nullable counterpart always succeeds.
The reverse direction (nullable to non-nullable) succeeds if the array does not contain `null` elements; otherwise the cast results in an error.
Casts across element types (e.g. from `array(integer not null)` to `array(bigint not null)`) are currently not supported, the only exception being casts from and to string types.

:::info
Non-nullable element types use less memory and enable optimizations for certain array operations. Users are therefore advised to use the most "restrictive" element type possible, given the use case at hand.
:::

## Limitations

Arrays are subject to the following limitations:

- The size of an array is limited to 4GB. Hyper is not optimized for processing large arrays; performance may degrade before this limit is reached.
- Arrays cannot be nested.
- Arrays cannot be used as column types in non-temporary tables.

:::note
Also, see the restrictions regarding array support in [external formats](../external/formats.md).
:::
27 changes: 22 additions & 5 deletions website/docs/sql/datatype/index.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# Data Types

Hyper has a rich set of native data types available to users.
Hyper provides a rich set of native data types.

The table below shows all the built-in general-purpose
data types. Most of the alternative names listed in the "Aliases" column
are the names supported by Hyper for compatibility reasons with Postgres.
Generally, Hyper's type system can be divided into two buckets: _atomic_ types, which describe single values, and _composite_ types, which describe collections of values.
However, this distinction is made for educational purposes only; both kinds are equally supported, and there is no fundamental limitation applying to either category.

## Atomic Types

Atomic types comprise fundamental, general-purpose data types.
The following table lists all available atomic types.
Most of the alternative names listed in the "Aliases" column are supported for compatibility with PostgreSQL.

Name|Aliases|Description
---|---|---
Expand Down Expand Up @@ -34,11 +39,23 @@ Persisting 32-bit floating point values (e.g., type `REAL`) requires at least [d
Up until Hyper API release [0.0.18825](/docs/releases#0.0.18825) Hyper used 64-bit floating points for all float types (i.e., also for `REAL`).
:::

## Composite Types

Composite types are collections of multiple data items in a single SQL value.
They allow for dedicated [schema denormalization][schema-denormalization], which can be useful for specific domains, such as machine learning applications.

Links to detailed documentation:
### Array

An array is an ordered sequence of values.
Arrays in Hyper are strongly-typed and can be built from all supported atomic types.
See [Array Type](./array.md) for more details.

## Further Reading

```mdx-code-block
import DocCardList from '@theme/DocCardList';
<DocCardList />
```

[schema-denormalization]: https://en.wikipedia.org/wiki/Denormalization
3 changes: 3 additions & 0 deletions website/docs/sql/external/formats.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ Format |`format` Option Value |Recognized File Extensions |Schema Inference? |De
[Apache Iceberg](#external-format-iceberg) |`'iceberg'` |Specified path must point to table directory |Yes |The [Apache Iceberg format](https://iceberg.apache.org/); version 1 and 2 are supported; version 3 is not supported
[Apache Arrow](#external-format-arrow) | `'arrowfile'`, `'arrowstream'` | `arrow`, `arrows` | No | The [Apache Arrow format](https://arrow.apache.org/) version 1.3 [with restrictions](#external-format-arrow)

:::caution
[Array types](../datatype/array.md) can only be read from, but not written to, Apache Iceberg and Apache Parquet.
:::
## Format Options

Format options allow customizing the way the external format is read and
Expand Down
87 changes: 87 additions & 0 deletions website/docs/sql/scalar_func/arrays.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Array Functions

This section describes functions and operators to inspect and transform [`array` values](../datatype/array.md). In the following, `array(T)` denotes an array type with element type `T`.
The element type is omitted when irrelevant.


## Access & Inspection
The following functions yield access to array elements or metadata.
These operations are compatible with all (nullable and non-nullable) element types.

Signature|Description|Example
---|---|---
<code>array(T)**[**int**]**</code> → `T`| Returns the n-th element of the array (1-indexed). | `(array[1,2,3])[1]``1`
<code>array(T)**[**int**:**int**]**</code> → `T` | Returns the subarray within the given boundes (1-indexed, inclusive). |`(array[1,2,3])[2:3]``{2,3}` |
<code>**array_length(**array**)**</code> → `int` | Returns the length of the array. | `array_length(array[1,2,3])``3`
<code>**array_to_string(**array, text [, text]**)**</code>| Converts the array into a textual representation, with the given element separator and (optional) null indicator. | `array_to_string(array[1,2,3], ';')``1;2;3`<br/>`array_to_string(array[3,2,1,null], '⏰', '🎉')``3⏰2⏰1⏰🎉`

## Transformations
The following functions produce new array values from existing ones.
These operations are compatible with all (nullable and non-nullable) element types.

Signature|Description|Example
---|---|---
<code>**array_prepend(**T,array(T)**)**</code> → `array(T)`<br/>or<br/><code>T **\|\|** array(T)</code> → `array(T)`| Inserts a value at the front of the array.<br/>If `null` is prepended to an array with non-nullable element type, the resulting element type is nullable. | `array_prepend(1, array[2,3,4])``array[1,2,3,4]`<br/>`array_prepend(null, [2,3,4])``array[null,2,3,4]`
<code>**array_append(**array(T), T**)**</code> → `array(T)`<br/>or<br/><code>array(T) **\|\|** T</code> → `array(T)`| Inserts a value at the end of the array.<br/>If `null` is appended to an array with non-nullable element type, the resulting element type is nullable. | `array_append(array[1,2,3], 4)``array[1,2,3,4]`<br/>`array_append(array[1,2,3], null)``array[1,2,3,null]`
<code>**array_cat(**array(T), array(T)**)**</code> → `array(T)`<br/>or<br/><code>array(T) **\|\|** array(T)</code> → `array(T)`| Concatenates two arrays.<br/>The resulting element type is non-nullable if and only if both input element types are non-nullable. | `array_cat(array[1,2], array[3,4])``array[1,2,3,4]`<br/>`array_cat(array[1,null], array[3,4])``array[1,null,3,4]`<br/>`array_cat(array[1,2], array[null,4])``array[1,2,null,4]`


## Inner Products
Hyper offers high-performance implementations of common inner products in vector spaces. These functions operate on two input vectors (represented as arrays) and produce a scalar value of type `double precision`.

Contrary to other array operations, vector functions only apply to a certain set of arrays:
- the element type must be `real` or `double precision`
- both arrays must have the same element type (but may have different element nullability)
- both arrays must have the same length

If any of the input arrays contains a `null` element, or is itself `null`, the result of an inner product will be `null`.
Passing arrays of different lengths or incompatible types will result in an error.

In the following table, signatures and examples are abbreviated for clarity. `vec` denotes a suitable array type (i.e., `array(real)` or `array(double precision not null)`). It is implied that both array arguments have the same element type. Similarly, the `{1.0, 2.0, 3.0}` syntax in the example column represents a suitable array value (e.g., `'{1.0, 2.0, 3.0}'::array(real not null)`).

|Signature|Description|Example
|---|---|---|
|<code>**dot_product(**vec, vec**)**</code> → `double precision`| Computes the conventional [dot product][dot-product] between two vectors. | <code>dot_product({1.0, 2.0, 3.0}, {-1.0, 2.0, -3.0})</code> → `-6.0`<br/><code>dot_product({1.0, null, 3.0}, {-1.0, 2.0, -3.0})</code> → `null`
|<code>**cosine_similarity(**vec, vec**)**</code> → `double precision`| Computes [cosine similarity][cosine-similarity] between two vectors. | <code>cosine_similarity({1.0, 2.0, 3.0}, {-1.0, 2.0, -3.0})</code> → `-0.42857...`<br/><code>cosine_similarity({1.0, 2.0, 3.0}, {null, 2.0, -3.0})</code> → `null`

:::tip
If possible, prefer arrays with non-nullable element types (i.e., `array(real not null)` or `array(double precision not null)`) when computing inner products.
This allows Hyper to skip element `null` checks, resulting in better performance.
:::

### Example Use Case: Vector Search

With inner products, vector search can be implemented directly in Hyper.

:::info
[Vector search][vector-space-model] is a search technique to find semantically similar items (called _documents_) in a given corpus.
Each document is represented as an _embedding vector_ in a high-dimensional vector space, e.g. using word-count statistics or machine learning models.
Two documents are considered similar if their embeddings are "close by" in the space &mdash; as measured by an inner product.
:::

For illustration, consider the problem of finding products similar to a search string in the product catalog of an e-commerce business.
To simplify the example, let's assume that the product catalog is already loaded into a temporary table `products` with the following columns:

- `description_vec` (type: `array(real not null)`) a description of the product, reprsented as a suitable embedding vector
- `name` (type: `text`) the name of the product

Further, let's assume that the search string has already been converted into a vector `{1.1, -0.2, 0.7, -0.3}`, using the same embedding model as the `description_vec` column.

Retrieving the top five most-similar products can be expressed in SQL as follows:

```sql
select
name,
dot_product('{1.1, -0.2, 0.7, -0.3}'::array(real not null), description_vec) as score
from products
limit 5
order by score descending
```

:::note
The embedding vector in this example has been chosen as four-dimensional to keep the query readable; realistic applications likely use much higher dimensionalities.
:::

[dot-product]: https://en.wikipedia.org/wiki/Dot_product
[cosine-similarity]: https://en.wikipedia.org/wiki/Cosine_similarity
[vector-space-model]: https://en.wikipedia.org/wiki/Vector_space_model
2 changes: 2 additions & 0 deletions website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ const sidebars = {
"sql/datatype/numeric",
"sql/datatype/datetime",
"sql/datatype/binary",
"sql/datatype/array"
],
},
{
Expand All @@ -96,6 +97,7 @@ const sidebars = {
"sql/scalar_func/string",
"sql/scalar_func/string_matching",
"sql/scalar_func/formatting",
"sql/scalar_func/arrays",
"sql/scalar_func/datetime",
"sql/scalar_func/geography",
],
Expand Down

0 comments on commit ae11197

Please sign in to comment.