Skip to content

Commit

Permalink
Merge pull request #22 from bag-cnag/dev
Browse files Browse the repository at this point in the history
Release 0.6
  • Loading branch information
Neah-Ko authored Aug 8, 2024
2 parents 50bdc91 + 70a2632 commit 9b01b7f
Show file tree
Hide file tree
Showing 43 changed files with 1,133 additions and 596 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ jobs:
- name: Run unit tests
run: |
cd src/tests/unit
pytest
python -m pytest
cd -
- name: Analysing the code with pylint
Expand Down
2 changes: 1 addition & 1 deletion docker/Dockerfile.biodm-test-runner
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@ COPY ${TEST_DIR_PATH} /tests

WORKDIR /tests

ENTRYPOINT ["pytest"]
ENTRYPOINT ["python", "-m", "pytest"]
3 changes: 0 additions & 3 deletions docs/developer_manual/doc_endpoints.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,6 @@ decorators.
responses:
201:
description: Write Dataset resource.
examples: |
# TODO:
{"name": "instant_sc_1234", ""}
content:
application/json:
schema: DatasetSchema
Expand Down
16 changes: 16 additions & 0 deletions docs/developer_manual/permissions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,19 @@ with a nested collection and its elements.

Those permissions will be taken into account when directly accessing ``/files`` API routes.


Strict composition
~~~~~~~~~~~~~~~~~~

Currently, ``BioDM`` assumes a strict composition pattern of resource for those permissions.
Which allow them to be taken into account while directly accessing children resource routes
like mentioned above.

Unfortunately, that also means that distributing permissions from two, or more, parent level
resources is currently not tested and shall most likely result in soft-locking those resources.

This may or may not be supported in a future version of the Core, depending on technical
feasibility.

If you wish to achieve something in that vein, it is for now advised to create an identical
resource with a different name.
22 changes: 15 additions & 7 deletions docs/developer_manual/s3conf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,25 @@ The following variables have to be provided.
File management
----------------
To ensure bucket key uniqueness for uploaded files, the key gets prefixed by
``S3File.key_salt`` column. By default this is an ``uuid4`` but in case you would like to
manage this differently you could override this attribute in ``File`` class.
``S3File.key_salt`` column. By default this is an ``uuid4``.

In case you would like to have precise control over how your files are named on the bucket this
can be done by overloading ``key_salt`` with a ``hybrid_property`` in the following way.

.. code-block:: python
:caption: demo.py
from sqlalchemy.ext.hybrid import hybrid_property
class File(bd.components.S3File, bd.components.Base):
class File()
...
@declared_attr
@classmethod
def key_salt(cls) -> Mapped[str]:
# Replace lambda below by a personalized function.
return Column(String(8), nullable=False, default=lambda: "myprefix")
@hybrid_property
async def key_salt(self) -> str:
# Pop session, populated by S3Service just before asking for that attr.
session = self.__dict__.pop('session')
# Use session to fetch what you need.
await session.refresh(self, ['dataset'])
await session.refresh(self.dataset, ['project'])
# Build your custom prefix.
return f"{self.dataset.project.name}_{self.dataset.name}"
128 changes: 125 additions & 3 deletions docs/developer_manual/table_schema.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,128 @@
Designing Table and Schemas
Tables and Schemas
============================

This section provides some recommended practices in order to design your tables and schemas.
This section describes how ``BioDM`` is leveraging your ``Tables`` and ``Schemas`` in order to set
up resources. It contains some useful information for developers to design their own.

It is recommended to visit the documentation of both ``SQLAlchemy`` and ``Marshmallow`` beforehand.
Moreover, our ``example`` project also provides plenty of inspiration for this task.

Tables
------

In principle any valid ``SQLALchemy`` table is accepted by ``BioDM``. Yet,
depending how you configure it, it shall adopt varying behaviours.

SQLAlchemy 2.0 ORM API is a really nice and convenient piece of technology.
However, it does not natively support trees of entities (nested dictionaries).

To palliate this problem, ``BioDM`` does a pass of parsing on schema validation results under the
hood, using Tables as litterals in order to build the hirarchical tree of statements.
In a second pass, it inserts the whole tree in order, ensuring integrity.


Relationship specifics
~~~~~~~~~~~~~~~~~~~~~~

Statement building leverages tables ``relationships`` definitions, taking orientation into account.

In particular, if a relationship is one-armed (pointing in one direction only), it will not
be possible to create a nested resource in the other direction.


Schemas
-------

``Marshmallow`` also comes with some limitations, such as not being able to infer foreign key
population in respect to nested entities while ``de-serializing``.

**E.g.** Given the following matching table and schema:

.. code:: python
class Dataset(Base):
id: Mapped[int] = mapped_column(Integer(), primary_key=True)
...
id_project: Mapped[int] = mapped_column(ForeignKey("PROJECT.id"), nullable=False)
project: Mapped["Project"] = relationship(back_populates="datasets")
class DatasetSchema(Schema):
id = Integer()
...
id_project = Integer()
project = Nested('ProjectSchema')
If you ``POST`` a new ``/datasets`` resource definition with a nested project.
Upon validating, ``id_project`` will not be populated, which ultimately is your
``NOT NULL FOREIGN KEY`` field. Hence SQL insert statement shall raise integrity errors.

``Marshmallow`` is offering built-ins in the form of decorators that let you tag functions
attached to the ``Schema`` such as ``@pre_load`` which is a hook called before validation,
that lets you manually get data if you detect it present in the dict.

This technique has two major disadvantages:

* It is quite cumbersome and error prone for the developer, as for each relationship you may
have to set foreign keys on either side and is as many conditions checking what is
present in your input dict and whatnot.

* This cannot take into account generated keys. In our example, we may be creating the
project as well. Hence it will not have an id yet, thus raise a ``ValidationError`` for the
dataset if we set ``required=True`` flag for ``id_project``.


To bypass those limitations, ``BioDM`` validates incoming data using ``Marshmallow``'s
``partial=True`` flag. Meaning that ``required`` keywords on fields are ignored and may be skipped
overall. At validation step we are checking the overall structure and type of fields.

This yields a (List of) dictionary (of nested Dictionaries) that is sent down to a ``Service``
for statement building and insertion. The Core will use knowledge of Table relationships to infer
this foreign key population and raise appropriate errors in case of truely incomplete input data.

This ultimately allows for more flexibily on input such as sending a mixin of create/update of new
resources via ``POST``.


Nested flags policy
~~~~~~~~~~~~~~~~~~~

Serialization is following down ``Nested`` fields. In particular that means it is important to
limit the depth of data that is fetched, as it is easy to end up in infinite loops in case of
circular dependencies.

**E.g.**

.. code:: python
class GroupSchema(Schema):
"""Schema for Keycloak Groups. id field is purposefully left out as we manage it internally."""
path = String(metadata={"description": "Group name chain separated by '__'"})
...
users = List(Nested('UserSchema', exclude=['groups']))
children = List(Nested('GroupSchema', exclude=['children', 'parent']))
parent = Nested('GroupSchema', exclude=['children', 'parent'])
In the example above, without those exclude flags, excluding references to nested Groups further
down Serialization would go into infinite recursion.

Marshmallow provides other primitives: ``only``, ``load_only``, ``dump_only``, that can also be
used to do this restriction.


.. warning::

It is important to make sure that your dumping configuration does not impede a Schema's
loading capabilites of essential fields for creating a new resource.


For most cases, you may simply set fields identical to matching Table, using Marshmallow syntax.
Furthemore, Schemas are the "i/o surface" of your app. This is where you decide what gets
loaded and dumped for a specific resource.

.. note::

Setting "metadata.description" like for path in our example example above, is used for
automatic apispec docstrings generation.


TODO: Coming up
91 changes: 84 additions & 7 deletions docs/user_manual.rst
Original file line number Diff line number Diff line change
Expand Up @@ -209,16 +209,28 @@ there through `boto3 presigned-urls <https://boto3.amazonaws.com/v1/documentatio

* Upload

On creating a file, the resource will contain a field named ``upload_form`` that is a presigned
PUT request dictionary that you may use to perform direct upload.
On creating a new ``/file`` resource, it is required that you pass in the size in ``bytes`` that
you can obtain from its descriptor.

The following snippet lets you upload via script:
The resource shall contain a nested dictionary called ``upload`` composed of ``parts``,
containing presigned form for direct file upload.

Next we distinguish two cases:

Small files
~~~~~~~~~~~

In the case of a `small` file, i.e. less than `100MB` there is a single ``part``, containing a
presigned ``POST`` and you may simply use the ``form`` to perform the upload.

The following snippet demonstrates how to do this in `python`:

.. code-block:: python
:caption: up_bucket.py
:caption: upload_small_file.py
import requests
# obtained from file['upload']['parts'][0]['form'] creation response
post = {'url': ..., 'fields': ...}
file_path = "/path/to/my_file.ext"
Expand All @@ -232,19 +244,84 @@ The following snippet lets you upload via script:
files=files,
verify=True,
allow_redirects=True)
assert http_response.status_code == 201
assert http_response.status_code == 201
Upon completion, BioDM will be notified back via a callback, so the file is immediately available.


Large files
~~~~~~~~~~~

For large files, several parts will be present. Each allowing you to upload a chunk of
`size=100MB`, possibly less for the last one.

For each part successfuly uploaded, the bucket will return you an ``ETag`` that you have to
keep track of and associate with the correct ``part_number``.

Ultimately, the process has to be completed by submitting that mapping in order for the bucket
to aggregate all chunks into a file stored on the bucket. The bucket does not supports passing a
callback for a ``part_upload``.

Similarely here is an example using ``python``:

.. code-block:: python
:caption: upload_large_file.py
import requests
CHUNK_SIZE = 100*1024**2 # 100MB
parts_etags = []
host: str = ... # Server instance endpoint
file_id = ... # obtained from file['id']
upload_forms = [{'part_number': 1, 'form': ...}, ...] # obtained from file['upload']['parts']
# Upload file
with open(big_file_path, 'rb') as file:
for part in upload_forms:
part_data = file.read(CHUNK_SIZE) # Fetch one chunk.
response = requests.put(
part['form'], data=part_data, headers={'Content-Encoding': 'gzip'}
)
assert response.status_code == 200
# Get etag and remove trailing quotes to not disturb subsequent (json) loading.
etag = response.headers.get('ETag', "").replace('"', '')
# Build mapping.
parts_etags.append({'PartNumber': part['part_number'], 'ETag': etag})
# Send completion notice with the mapping.
complete = requests.put(
f"{host}/files/{file_id}/complete_multipart",
data=json.dumps(parts_etags).encode('utf-8')
)
assert complete.status_code == 201
assert 'Completed.' in complete.text
.. note::

This example above is a quite naive approach. For very large files, you should make use of a
concurrency library (such as ``concurrent.futures`` or ``multiprocessing`` in ``python``) in
order to speed up that process, as parts can be uploaded in any order.

* Download

Calling ``GET /my_file_resources`` will only return associated metadata
Calling ``GET /my_file_resources`` will only return associated metadata (and the upload form(s)
while it is still in prending state).

To download a file use the following endpoint.

.. code-block:: bash
curl ${SERVER_ENDPOINT}/my_file_resources/{id}/download
That will return a url to directly download the file via GET request.
That will return a url to directly download the file via ``GET`` request.

.. note::

Download urls are coming back with a redirect header, thus you may use
``allow_redirects=True`` flag or equivalent when visiting this route to download in one go.


User permissions
Expand Down
2 changes: 1 addition & 1 deletion src/biodm/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""BioDM framework."""
__version__ = '0.5.3'
__version__ = '0.6.0'
__version_info__ = ([int(num) for num in __version__.split('.')])


Expand Down
19 changes: 9 additions & 10 deletions src/biodm/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from biodm.exceptions import RequestError
from biodm.utils.security import UserInfo
from biodm.utils.utils import to_it
from biodm.tables import History, ListGroup
from biodm.tables import History, ListGroup, Upload, UploadPart
from biodm import __version__ as CORE_VERSION


Expand Down Expand Up @@ -132,7 +132,8 @@ def __init__(

## Controllers.
classes = CORE_CONTROLLERS + (controllers or [])
classes.append(K8sController)
if hasattr(self, 'k8'):
classes.append(K8sController)
routes = self.adopt_controllers(classes)

## Schema Generator.
Expand All @@ -146,28 +147,26 @@ def __init__(
security=[{'Authorization': []}] # Same name as security_scheme arg below.
)
)

token = {
self.apispec.spec.components.security_scheme("Authorization", {
"type": "http",
"name": "authorization",
"in": "header",
"scheme": "bearer",
"bearerFormat": "JWT"
}

self.apispec.spec.components.security_scheme("Authorization", token)
})

"""Headless Services
For entities that are managed internally: not exposing routes.
i.e. only ListGroups and History atm
Since the controller normally instanciates the service, and it does so
because the services needs to access the app instance.
If more useful cases for this show up we might want to design a cleaner solution.
"""
History.svc = UnaryEntityService(app=self, table=History)
ListGroup.svc = CompositeEntityService(app=self, table=ListGroup)
History.svc = UnaryEntityService(app=self, table=History)
UploadPart.svc = UnaryEntityService(app=self, table=UploadPart)
ListGroup.svc = CompositeEntityService(app=self, table=ListGroup)
Upload.svc = CompositeEntityService(app=self, table=Upload)

super().__init__(debug, routes, *args, **kwargs)

Expand Down
Loading

0 comments on commit 9b01b7f

Please sign in to comment.