Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database: Add autoincrement, uniqueness, and sync-writes polyfills #28

Merged
merged 4 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
- Added `CrateIdentifierPreparer`, in order to quote reserved words
like `object` properly, for example when used as column names.
- Fixed `CrateDialect.get_pk_constraint` to return `list` instead of `set` type
- Added re-usable patches and polyfills from application adapters.
New utilities: `patch_autoincrement_timestamp`, `refresh_after_dml`,
`check_uniqueness_factory`

## 2024/06/13 0.37.0
- Added support for CrateDB's [FLOAT_VECTOR] data type and its accompanying
Expand Down
1 change: 1 addition & 0 deletions docs/index-all.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ CrateDB SQLAlchemy dialect -- all pages
advanced-querying
inspection-reflection
dataframe
support
16 changes: 15 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -135,7 +135,7 @@ Load results into `pandas`_ DataFrame.
print(df)


Data types
Data Types
==========

The :ref:`DB API driver <crate-python:index>` and the SQLAlchemy dialect
Expand All @@ -150,6 +150,20 @@ extension types <using-extension-types>` documentation pages.

data-types

Support Utilities
=================

The package bundles a few support and utility functions that try to fill a few
gaps you will observe when working with CrateDB, when compared with other
databases.
Due to its distributed nature, CrateDB's behavior and features differ from those
found in other RDBMS systems.

.. toctree::
:maxdepth: 2

support


.. _examples:
.. _by-example:
Expand Down
47 changes: 43 additions & 4 deletions docs/overview.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
.. _overview:
.. _using-sqlalchemy:

========
Overview
========
=================
Features Overview
=================

.. rubric:: Table of contents

Expand Down Expand Up @@ -300,15 +300,28 @@ would translate into the following declarative model:
>>> log.id
...

.. _auto-generated-identifiers:

Auto-generated primary key
Auto-generated identifiers
..........................

CrateDB does not provide traditional sequences or ``SERIAL`` data type support,
which enable automatically assigning incremental values when inserting records.
However, it offers server-side support by providing an SQL function to generate
random identifiers of ``STRING`` type, and client-side support for generating
``INTEGER``-based identifiers, when using the SQLAlchemy dialect.

.. _gen_random_text_uuid:

``gen_random_text_uuid``
~~~~~~~~~~~~~~~~~~~~~~~~

CrateDB 4.5.0 added the :ref:`gen_random_text_uuid() <crate-reference:scalar-gen_random_text_uuid>`
scalar function, which can also be used within an SQL DDL statement, in order to automatically
assign random identifiers to newly inserted records on the server side.

In this spirit, it is suitable to be used as a ``PRIMARY KEY`` constraint for SQLAlchemy.
It works on SQLAlchemy-defined columns of type ``sa.String``.

A table schema like this

Expand All @@ -334,6 +347,32 @@ would translate into the following declarative model:
>>> item.id
...

.. _timestamp-autoincrement:

Timestamp-based Autoincrement
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

By using SQLAlchemy's ``sa.func.now()``, you can assign automatically generated
identifiers to SQLAlchemy columns of types ``sa.BigInteger``, ``sa.DateTime``,
and ``sa.String``.

This emulates autoincrement / sequential ID behavior for designated columns, based
on assigning timestamps on record insertion.

>>> class Item(Base):
... id = sa.Column("id", sa.BigInteger, default=func.now(), primary_key=True)
... name = sa.Column("name", sa.String)

>>> item = Item(name="Foobar")
>>> session.add(item)
>>> session.commit()
>>> item.id
...

There is a support utility which emulates autoincrement / sequential ID
behavior for designated columns, based on assigning timestamps on record
insertion. See :ref:`support-autoincrement`.


.. _using-extension-types:

Expand Down
213 changes: 213 additions & 0 deletions docs/support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
(support-features)=
(support-utilities)=
# Support Features

The package bundles a few support and utility functions that try to fill a few
gaps you will observe when working with CrateDB, a distributed OLAP database,
since it lacks certain features, usually found in traditional OLTP databases.

A few of the features outlined below are referred to as [polyfills], and
emulate a few functionalities, for example, to satisfy compatibility issues on
downstream frameworks or test suites. You can use them at your disposal, but
you should know what you are doing, as some of them can seriously impact
performance.

Other features include efficiency support utilities for 3rd-party frameworks,
which can be used to increase performance, mostly on INSERT operations.


(support-insert-bulk)=
## Bulk Support for pandas and Dask

:::{rubric} Background
:::
CrateDB's [](inv:crate-reference#http-bulk-ops) interface enables efficient
INSERT, UPDATE, and DELETE operations for batches of data. It enables
bulk operations, which are executed as single calls on the database server.

:::{rubric} Utility
:::
The `insert_bulk` utility provides efficient bulk data transfers when using
dataframe libraries like pandas and Dask. {ref}`dataframe` dedicates a whole
page to corresponding topics, about choosing the right chunk sizes, concurrency
settings, and beyond.

:::{rubric} Synopsis
:::
Use `method=insert_bulk` on pandas' or Dask's `to_sql()` method.
```python
import sqlalchemy as sa
from sqlalchemy_cratedb.support import insert_bulk
from pueblo.testing.pandas import makeTimeDataFrame

# Create a pandas DataFrame, and connect to CrateDB.
df = makeTimeDataFrame(nper=42, freq="S")
engine = sa.create_engine("crate://")

# Insert content of DataFrame using batches of records.
df.to_sql(
name="testdrive",
con=engine,
if_exists="replace",
index=False,
method=insert_bulk,
)
```

(support-autoincrement)=
## Synthetic Autoincrement using Timestamps

:::{rubric} Background
:::
CrateDB does not provide traditional sequences or `SERIAL` data type support,
which enable automatically assigning incremental values when inserting records.


:::{rubric} Utility
:::
- The `patch_autoincrement_timestamp` utility emulates autoincrement /
sequential ID behavior for designated columns, based on assigning timestamps
on record insertion.
- It will simply assign `sa.func.now()` as a column `default` on the ORM model
column.
- It works on the SQLAlchemy column types `sa.BigInteger`, `sa.DateTime`,
and `sa.String`.
- You can use it if adjusting ORM models for your database adapter is not
an option.

:::{rubric} Synopsis
:::
After activating the patch, you can use `autoincrement=True` on column definitions.
```python
import sqlalchemy as sa
from sqlalchemy.orm import declarative_base
from sqlalchemy_cratedb.support import patch_autoincrement_timestamp

# Enable patch.
patch_autoincrement_timestamp()

# Define database schema.
Base = declarative_base()

class FooBar(Base):
id = sa.Column(sa.DateTime, primary_key=True, autoincrement=True)
```

:::{warning}
CrateDB's [`TIMESTAMP`](inv:crate-reference#type-timestamp) data type provides
milliseconds granularity. This has to be considered when evaluating collision
safety in high-traffic environments.
:::


(support-synthetic-refresh)=
## Synthetic Table REFRESH after DML

:::{rubric} Background
:::
CrateDB is [eventually consistent]. Data written with a former statement is
not guaranteed to be fetched with the next following select statement for the
affected rows.

Data written to CrateDB is flushed periodically, the refresh interval is
1000 milliseconds by default, and can be changed. More details can be found in
the reference documentation about [table refreshing](inv:crate-reference#refresh_data).

There are situations where stronger consistency is required, for example when
needing to satisfy test suites of 3rd party frameworks, which usually do not
take such special behavior of CrateDB into consideration.

:::{rubric} Utility
:::
- The `refresh_after_dml` utility will configure an SQLAlchemy engine or session
to automatically invoke `REFRESH TABLE` statements after each DML
operation (INSERT, UPDATE, DELETE).
- Only relevant (dirty) entities / tables will be considered to be refreshed.

:::{rubric} Synopsis
:::
```python
import sqlalchemy as sa
from sqlalchemy_cratedb.support import refresh_after_dml

engine = sa.create_engine("crate://")
refresh_after_dml(engine)
```

```python
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
from sqlalchemy_cratedb.support import refresh_after_dml

engine = sa.create_engine("crate://")
session = sessionmaker(bind=engine)()
refresh_after_dml(session)
```

:::{warning}
Refreshing the table after each DML operation can cause serious performance
degradations, and should only be used on low-volume, low-traffic data,
when applicable, and if you know what you are doing.
:::


(support-unique)=
## Synthetic UNIQUE Constraints

:::{rubric} Background
:::
CrateDB does not provide `UNIQUE` constraints in DDL statements. Because of its
distributed nature, supporting such a feature natively would cause expensive
database cluster operations, negating many benefits of using database clusters
firsthand.

:::{rubric} Utility
:::
- The `check_uniqueness_factory` utility emulates "unique constraints"
functionality by querying the table for unique values before invoking
SQL `INSERT` operations.
- It uses SQLALchemy [](inv:sa#orm_event_toplevel), more specifically
the [before_insert] mapper event.
- When the uniqueness constraint is violated, the adapter will raise a
corresponding exception.
```python
IntegrityError: DuplicateKeyException in table 'foobar' on constraint 'name'
```

:::{rubric} Synopsis
:::
```python
import sqlalchemy as sa
from sqlalchemy.orm import declarative_base
from sqlalchemy.event import listen
from sqlalchemy_cratedb.support import check_uniqueness_factory

# Define database schema.
Base = declarative_base()

class FooBar(Base):
id = sa.Column(sa.String, primary_key=True)
name = sa.Column(sa.String)

# Add synthetic UNIQUE constraint on `name` column.
listen(FooBar, "before_insert", check_uniqueness_factory(FooBar, "name"))
```

[before_insert]: https://docs.sqlalchemy.org/en/20/orm/events.html#sqlalchemy.orm.MapperEvents.before_insert

:::{note}
This feature will only work well if table data is consistent, which can be
ensured by invoking a `REFRESH TABLE` statement after any DML operation.
For conveniently enabling "always refresh", please refer to the documentation
section about [](#support-synthetic-refresh).
:::

:::{warning}
Querying the table before each INSERT operation can cause serious performance
degradations, and should only be used on low-volume, low-traffic data,
when applicable, and if you know what you are doing.
:::


[eventually consistent]: https://en.wikipedia.org/wiki/Eventual_consistency
[polyfills]: https://en.wikipedia.org/wiki/Polyfill_(programming)
13 changes: 13 additions & 0 deletions src/sqlalchemy_cratedb/support/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
from sqlalchemy_cratedb.support.pandas import insert_bulk
from sqlalchemy_cratedb.support.polyfill import check_uniqueness_factory, refresh_after_dml, \
patch_autoincrement_timestamp
from sqlalchemy_cratedb.support.util import refresh_table, refresh_dirty

__all__ = [
check_uniqueness_factory,
insert_bulk,
patch_autoincrement_timestamp,
refresh_after_dml,
refresh_dirty,
refresh_table,
]
Loading