Merge pull request #22 from bag-cnag/dev

Release 0.6
bag-cnag · Aug 8, 2024 · 9b01b7f · 9b01b7f
2 parents 50bdc91 + 70a2632
commit 9b01b7f
Show file tree

Hide file tree

Showing 43 changed files with 1,133 additions and 596 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -41,7 +41,7 @@ jobs:
     - name: Run unit tests
       run: |
         cd src/tests/unit
-        pytest
+        python -m pytest
         cd -
 
     - name: Analysing the code with pylint

diff --git a/docker/Dockerfile.biodm-test-runner b/docker/Dockerfile.biodm-test-runner
@@ -16,4 +16,4 @@ COPY ${TEST_DIR_PATH} /tests
 
 WORKDIR /tests
 
-ENTRYPOINT ["pytest"]
+ENTRYPOINT ["python", "-m", "pytest"]
diff --git a/docs/developer_manual/doc_endpoints.rst b/docs/developer_manual/doc_endpoints.rst
@@ -37,9 +37,6 @@ decorators.
             responses:
             201:
                 description: Write Dataset resource.
-                examples: |
-                    # TODO:
-                    {"name": "instant_sc_1234", ""}
                 content:
                     application/json:
                         schema: DatasetSchema

diff --git a/docs/developer_manual/permissions.rst b/docs/developer_manual/permissions.rst
@@ -97,3 +97,19 @@ with a nested collection and its elements.
 
     Those permissions will be taken into account when directly accessing ``/files`` API routes. 
 
+
+Strict composition
+~~~~~~~~~~~~~~~~~~
+
+Currently, ``BioDM`` assumes a strict composition pattern of resource for those permissions.
+Which allow them to be taken into account while directly accessing children resource routes
+like mentioned above.
+
+Unfortunately, that also means that distributing permissions from two, or more, parent level
+resources is currently not tested and shall most likely result in soft-locking those resources.
+
+This may or may not be supported in a future version of the Core, depending on technical
+feasibility.
+
+If you wish to achieve something in that vein, it is for now advised to create an identical
+resource with a different name.
diff --git a/docs/developer_manual/s3conf.rst b/docs/developer_manual/s3conf.rst
@@ -19,17 +19,25 @@ The following variables have to be provided.
 File management
 ----------------
 To ensure bucket key uniqueness for uploaded files, the key gets prefixed by
-``S3File.key_salt`` column. By default this is an ``uuid4`` but in case you would like to
-manage this differently you could override this attribute in ``File`` class.
+``S3File.key_salt`` column. By default this is an ``uuid4``.
+
+In case you would like to have precise control over how your files are named on the bucket this
+can be done by overloading ``key_salt`` with a ``hybrid_property`` in the following way.
 
 .. code-block:: python
     :caption: demo.py
 
+    from sqlalchemy.ext.hybrid import hybrid_property
+
     class File(bd.components.S3File, bd.components.Base):
         class File()
             ...
-            @declared_attr
-            @classmethod
-            def key_salt(cls) -> Mapped[str]:
-                # Replace lambda below by a personalized function.
-                return Column(String(8), nullable=False, default=lambda: "myprefix")
+            @hybrid_property
+            async def key_salt(self) -> str:
+                # Pop session, populated by S3Service just before asking for that attr.
+                session = self.__dict__.pop('session')
+                # Use session to fetch what you need.
+                await session.refresh(self, ['dataset'])
+                await session.refresh(self.dataset, ['project'])
+                # Build your custom prefix.
+                return f"{self.dataset.project.name}_{self.dataset.name}"
diff --git a/docs/developer_manual/table_schema.rst b/docs/developer_manual/table_schema.rst
@@ -1,6 +1,128 @@
-Designing Table and Schemas
+Tables and Schemas
 ============================
 
-This section provides some recommended practices in order to design your tables and schemas.
+This section describes how ``BioDM`` is leveraging your ``Tables`` and ``Schemas`` in order to set
+up resources. It contains some useful information for developers to design their own.
+
+It is recommended to visit the documentation of both ``SQLAlchemy`` and ``Marshmallow`` beforehand.
+Moreover, our ``example`` project also provides plenty of inspiration for this task.
+
+Tables
+------
+
+In principle any valid ``SQLALchemy`` table is accepted by ``BioDM``. Yet,
+depending how you configure it, it shall adopt varying behaviours.
+
+SQLAlchemy 2.0 ORM API is a really nice and convenient piece of technology.
+However, it does not natively support trees of entities (nested dictionaries).
+
+To palliate this problem, ``BioDM`` does a pass of parsing on schema validation results under the
+hood, using Tables as litterals in order to build the hirarchical tree of statements.
+In a second pass, it inserts the whole tree in order, ensuring integrity.
+
+
+Relationship specifics
+~~~~~~~~~~~~~~~~~~~~~~
+
+Statement building leverages tables ``relationships`` definitions, taking orientation into account.
+
+In particular, if a relationship is one-armed (pointing in one direction only), it will not
+be possible to create a nested resource in the other direction.
+
+
+Schemas
+-------
+
+``Marshmallow`` also comes with some limitations, such as not being able to infer foreign key
+population in respect to nested entities while ``de-serializing``.
+
+**E.g.** Given the following matching table and schema:
+
+.. code:: python
+
+    class Dataset(Base):
+        id:          Mapped[int]   = mapped_column(Integer(), primary_key=True)
+        ...
+        id_project:  Mapped[int]   = mapped_column(ForeignKey("PROJECT.id"), nullable=False)
+        project: Mapped["Project"] = relationship(back_populates="datasets")
+
+    class DatasetSchema(Schema):
+        id = Integer()
+        ...
+        id_project = Integer()
+        project = Nested('ProjectSchema')
+
+If you ``POST`` a new ``/datasets`` resource definition with a nested project.
+Upon validating, ``id_project`` will not be populated, which ultimately is your
+``NOT NULL FOREIGN KEY`` field. Hence SQL insert statement shall raise integrity errors.
+
+``Marshmallow`` is offering built-ins in the form of decorators that let you tag functions
+attached to the ``Schema`` such as ``@pre_load`` which is a hook called before validation,
+that lets you manually get data if you detect it present in the dict.
+
+This technique has two major disadvantages:
+
+* It is quite cumbersome and error prone for the developer, as for each relationship you may
+  have to set foreign keys on either side and is as many conditions checking what is
+  present in your input dict and whatnot.
+
+* This cannot take into account generated keys. In our example, we may be creating the
+  project as well. Hence it will not have an id yet, thus raise a ``ValidationError`` for the
+  dataset if we set ``required=True`` flag for ``id_project``.
+
+
+To bypass those limitations, ``BioDM`` validates incoming data using ``Marshmallow``'s
+``partial=True`` flag. Meaning that ``required`` keywords on fields are ignored and may be skipped
+overall. At validation step we are checking the overall structure and type of fields.
+
+This yields a (List of) dictionary (of nested Dictionaries) that is sent down to a ``Service``
+for statement building and insertion. The Core will use knowledge of Table relationships to infer
+this foreign key population and raise appropriate errors in case of truely incomplete input data.
+
+This ultimately allows for more flexibily on input such as sending a mixin of create/update of new
+resources via ``POST``.
+
+
+Nested flags policy
+~~~~~~~~~~~~~~~~~~~
+
+Serialization is following down ``Nested`` fields. In particular that means it is important to
+limit the depth of data that is fetched, as it is easy to end up in infinite loops in case of
+circular dependencies.
+
+**E.g.**
+
+.. code:: python
+
+    class GroupSchema(Schema):
+        """Schema for Keycloak Groups. id field is purposefully left out as we manage it internally."""
+        path = String(metadata={"description": "Group name chain separated by '__'"})
+        ...
+        users = List(Nested('UserSchema', exclude=['groups']))
+        children = List(Nested('GroupSchema', exclude=['children', 'parent']))
+        parent = Nested('GroupSchema', exclude=['children', 'parent'])
+
+
+In the example above, without those exclude flags, excluding references to nested Groups further
+down Serialization would go into infinite recursion.
+
+Marshmallow provides other primitives: ``only``, ``load_only``, ``dump_only``, that can also be
+used to do this restriction.
+
+
+.. warning::
+
+    It is important to make sure that your dumping configuration does not impede a Schema's
+    loading capabilites of essential fields for creating a new resource.
+
+
+For most cases, you may simply set fields identical to matching Table, using Marshmallow syntax.
+Furthemore, Schemas are the "i/o surface" of your app. This is where you decide what gets
+loaded and dumped for a specific resource.
+
+.. note::
+
+    Setting "metadata.description" like for path in our example example above, is used for
+    automatic apispec docstrings generation.
+
 
-TODO: Coming up
diff --git a/docs/user_manual.rst b/docs/user_manual.rst
@@ -209,16 +209,28 @@ there through `boto3 presigned-urls <https://boto3.amazonaws.com/v1/documentatio
 
 * Upload
 
-On creating a file, the resource will contain a field named ``upload_form`` that is a presigned
-PUT request dictionary that you may use to perform direct upload.
+On creating a new ``/file`` resource, it is required that you pass in the size in ``bytes`` that
+you can obtain from its descriptor.
 
-The following snippet lets you upload via script:
+The resource shall contain a nested dictionary called ``upload`` composed of ``parts``,
+containing presigned form for direct file upload.
+
+Next we distinguish two cases:
+
+Small files
+~~~~~~~~~~~
+
+In the case of a `small` file, i.e. less than `100MB` there is a single ``part``, containing a
+presigned ``POST`` and you may simply use the ``form`` to perform the upload.
+
+The following snippet demonstrates how to do this in `python`:
 
 .. code-block:: python
-    :caption: up_bucket.py
+    :caption: upload_small_file.py
 
     import requests
 
+    # obtained from file['upload']['parts'][0]['form'] creation response
     post = {'url': ..., 'fields': ...}
 
     file_path = "/path/to/my_file.ext"
@@ -232,19 +244,84 @@ The following snippet lets you upload via script:
             files=files,
             verify=True,
             allow_redirects=True)
-        assert http_response.status_code == 201 
+        assert http_response.status_code == 201
+
+
+Upon completion, BioDM will be notified back via a callback, so the file is immediately available.
+
+
+Large files
+~~~~~~~~~~~
+
+For large files, several parts will be present. Each allowing you to upload a chunk of
+`size=100MB`, possibly less for the last one.
+
+For each part successfuly uploaded, the bucket will return you an ``ETag`` that you have to
+keep track of and associate with the correct ``part_number``.
+
+Ultimately, the process has to be completed by submitting that mapping in order for the bucket
+to aggregate all chunks into a file stored on the bucket. The bucket does not supports passing a
+callback for a ``part_upload``.
+
+Similarely here is an example using ``python``:
+
+.. code-block:: python
+    :caption: upload_large_file.py
+
+    import requests
+
+    CHUNK_SIZE = 100*1024**2 # 100MB
+    parts_etags = []
+    host: str = ... # Server instance endpoint
+    file_id = ... # obtained from file['id']
+    upload_forms = [{'part_number': 1, 'form': ...}, ...] # obtained from file['upload']['parts']
+
+    # Upload file
+    with open(big_file_path, 'rb') as file:
+        for part in upload_forms:
+            part_data = file.read(CHUNK_SIZE) # Fetch one chunk.
+            response = requests.put(
+                part['form'], data=part_data, headers={'Content-Encoding': 'gzip'}
+            )
+            assert response.status_code == 200
+
+            # Get etag and remove trailing quotes to not disturb subsequent (json) loading.
+            etag = response.headers.get('ETag', "").replace('"', '')
+            # Build mapping.
+            parts_etags.append({'PartNumber': part['part_number'], 'ETag': etag})
+
+    # Send completion notice with the mapping.
+    complete = requests.put(
+        f"{host}/files/{file_id}/complete_multipart",
+        data=json.dumps(parts_etags).encode('utf-8')
+    )
+    assert complete.status_code == 201
+    assert 'Completed.' in complete.text
+
+
+.. note::
+
+    This example above is a quite naive approach. For very large files, you should make use of a
+    concurrency library (such as ``concurrent.futures`` or ``multiprocessing`` in ``python``) in
+    order to speed up that process, as parts can be uploaded in any order.
 
 * Download
 
-Calling ``GET /my_file_resources`` will only return associated metadata
+Calling ``GET /my_file_resources`` will only return associated metadata (and the upload form(s)
+while it is still in prending state).
 
 To download a file use the following endpoint.
 
 .. code-block:: bash
 
     curl ${SERVER_ENDPOINT}/my_file_resources/{id}/download
 
-That will return a url to directly download the file via GET request.
+That will return a url to directly download the file via ``GET`` request.
+
+.. note::
+
+    Download urls are coming back with a redirect header, thus you may use
+    ``allow_redirects=True`` flag or equivalent when visiting this route to download in one go.
 
 
 User permissions

diff --git a/src/biodm/__init__.py b/src/biodm/__init__.py
@@ -1,5 +1,5 @@
 """BioDM framework."""
-__version__ = '0.5.3'
+__version__ = '0.6.0'
 __version_info__ = ([int(num) for num in __version__.split('.')])
 
 

diff --git a/src/biodm/api.py b/src/biodm/api.py
@@ -28,7 +28,7 @@
 from biodm.exceptions import RequestError
 from biodm.utils.security import UserInfo
 from biodm.utils.utils import to_it
-from biodm.tables import History, ListGroup
+from biodm.tables import History, ListGroup, Upload, UploadPart
 from biodm import __version__ as CORE_VERSION
 
 
@@ -132,7 +132,8 @@ def __init__(
 
         ## Controllers.
         classes = CORE_CONTROLLERS + (controllers or [])
-        classes.append(K8sController)
+        if hasattr(self, 'k8'):
+            classes.append(K8sController)
         routes = self.adopt_controllers(classes)
 
         ## Schema Generator.
@@ -146,28 +147,26 @@ def __init__(
                 security=[{'Authorization': []}] # Same name as security_scheme arg below.
             )
         )
-
-        token = {
+        self.apispec.spec.components.security_scheme("Authorization", {
             "type": "http",
             "name": "authorization",
             "in": "header",
             "scheme": "bearer",
             "bearerFormat": "JWT"
-        }
-
-        self.apispec.spec.components.security_scheme("Authorization", token)
+        })
 
         """Headless Services
 
             For entities that are managed internally: not exposing routes.
-            i.e. only ListGroups and History atm
 
             Since the controller normally instanciates the service, and it does so
             because the services needs to access the app instance.
             If more useful cases for this show up we might want to design a cleaner solution.
         """
-        History.svc = UnaryEntityService(app=self, table=History)
-        ListGroup.svc = CompositeEntityService(app=self, table=ListGroup)
+        History.svc    = UnaryEntityService(app=self, table=History)
+        UploadPart.svc = UnaryEntityService(app=self, table=UploadPart)
+        ListGroup.svc  = CompositeEntityService(app=self, table=ListGroup)
+        Upload.svc     = CompositeEntityService(app=self, table=Upload)
 
         super().__init__(debug, routes, *args, **kwargs)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -16,4 +16,4 @@ COPY ${TEST_DIR_PATH} /tests

		WORKDIR /tests

		ENTRYPOINT ["pytest"]
		ENTRYPOINT ["python", "-m", "pytest"]