Added the Ingest Workflow Developer's Guide

lsst · Oct 9, 2024 · 48743b4 · 48743b4
1 parent ec926af
commit 48743b4
Show file tree

Hide file tree

Showing 16 changed files with 3,640 additions and 55 deletions.
diff --git a/doc/ingest/api/index.rst b/doc/ingest/api/index.rst
@@ -1,20 +1,103 @@
+
+.. note::
+
+   Information in this guide corresponds to the version **38** of the Qserv REST API. Keep in mind
+   that each implementation of the API has a specific version. The version number will change
+   if any changes to the implementation or the API that might affect users will be made.
+   The current document will be kept updated to reflect the latest version of the API.
+
 #####################################
 The Ingest Workflow Developer's Guide
 #####################################
 
-TBC
+.. toctree::
+   :maxdepth: 4
+
+   reference/index
+
+Introduction
+============
+
+This document presents an API that is available in Qserv for constructing the data ingest applications (also mentioned
+in the document as *ingest workflows*). The API is designed to provide a high-performance and reliable mechanism for
+ingesting large quantities of data where the high performance or reliability of the ingests is at stake.
+The document is intended to be a practical guide for the developers who are building those applications.
+It provides a high-level overview of the API, its main components, and the typical workflows that can be built using the API.
+
+At the very high level, the Qserv Ingest system is comprised of:
+
+- The REST server that is integrated into the Master Replication Controller. The server provides a collection
+  of services for managing metadata and states of the new catalogs to be ingested. The server also coordinates
+  its own operations with Qserv itself and the Qserv Replication System to prevent interferences with those
+  and minimize failures during catalog ingest activities.
+- The Data Ingest services run at each Qserv worker alongside the Replication System's worker services.
+  The role of these services is to actually ingest the client's data into the corresponding MySQL tables.
+  The services would also do an additional (albeit, minimal) preprocessing and data transformation (where or when needed)
+  before ingesting the input data into MySQL. Each worker server also includes its own REST server for processing
+  the "by reference" ingest requests as well as various metadata requests in the scope of the workers.
+
+Implementation-wise, the Ingest System heavily relies on services and functions of the Replication System including
+the Replication System's Controller Framework, various (including the Configuration) services, and the worker-side
+server infrastructure of the Replication System.
+
+Client workflows interact with the system's services via open interfaces (based on the HTTP protocol, REST services,
+JSON data format, etc.) and use ready-to-use tools to fulfill their goals of ingesting catalogs.
+
+Here is a brief summary of the Qserv Ingest System's features:
+
+- It introduces the well-defined states and semantics into the ingest process. With that, a process of ingesting a new catalog
+  now has to go through a sequence of specific steps maintaining a progressive state of the catalog within Qserv
+  while it's being ingested. The state transitions and the corresponding enforcements made by the system would
+  always ensure that the catalog would be in a consistent state during each step of the process.
+  Altogether, this model increases the robustness of the process, and it also makes it more efficient.
+
+- To facilitate and implement the above-mentioned state transitions the new system introduces a distributed
+  *tagging* and *checkpointing* mechanism called *super-transactions*. The transactions allow for incremental
+  updates of the overall state of the data and metadata while allowing to safely roll back to a prior consistent
+  state should any problem occur during data loading within such transactions.
+
+  - The data tagging capability of the transactions can be also used by the ingest workflows and by
+    the Qserv administrators for bookkeeping of the ingest activities and for the quality control of
+    the ingested catalogs.
+
+- In its very foundation, the system has been designed for constructing high-performance and parallel ingest
+  workflows w/o compromising the consistency of the ingested catalogs.
+
+- For the actual data loading, the system offers a few options, inluding pushing data into Qserv directly
+  via a proprietary binary protocol, HTTP streaming, or ingesting contributions "by reference". In the latter
+  case, the input data (so called *contributions*) will be pulled by the worker services from remote locations
+  as instructed by the ingest workflows. The presently supported sources include the object stores (via the HTTP/HTTPS
+  protocols) and the locally mounted distributed filesystems (via the POSIX protocol). (**TODO**: See Ingesting files directly from
+  workers for further details).
+
+- The data loading services also collect various information on the ongoing status of the ingest activities,
+  abnormal conditions that may occur during reading, interpreting, or loading the data into Qserv, as well
+  as the metadata for the data that is loaded. The information is retained within the persistent
+  state of the Replication/Ingest System for the monitoring and debugging purposes. A feedback is provided
+  to the workflows on various aspects of the ingest activities. The feedback is useful for the workflows to adjust their
+  behavior and to ensure the quality of the data being ingested.
+
+  - To get further info on this subject, see the sections Error reporting (**TODO**) and Using MySQL warnings (**TODO**).
+    In addition, the API provides REST services for obtaining metadata on the state of catalogs, tables, distributed
+    transactions, contribution requests, the progress of the requested operations, etc.
+
+**What the Ingest System does NOT do**:
+
+- As per its current implementation (which may change in the future) it does not automatically partition
+  input files. This task is expected to be a responsibility of the ingest workflows. The only data format
+  is is presently supported for the table payload are ``CSV`` and ``JSON`` (primarily for ingesting
+  user-generated data products as explained in :ref:`http-frontend-ingest`).
 
+- It does not (with an exception of adding an extra leading column ``qserv_trans_id`` required by
+  the implementation of the previously mentioned *super-transactions*) pre-process the input ``CSV``
+  payload sent to the Ingest Data Servers by the workflows for loading into tables.
+  It's up to the workflows to sanitize the input data and to make them ready to be ingested into Qserv.
 
-.. list-table:: Title
-   :widths: 25 25 50
-   :header-rows: 1
+More information on the requirements and the low-level technical details of its implementation (unless it's
+needed for the purposes of this document's goals) can be found elsewhere.
 
-   * - Heading row 1, column 1
-     - Heading row 1, column 2
-     - Heading row 1, column 3
-   * - Row 1, column 1
-     -
-     - Row 1, column 3
-   * - Row 2, column 1
-     - Row 2, column 2
-     - Row 2, column 3
+It's recommended to read the document sequentially. Most ideas presented in the document are introduced in
+a section "An example of a simple workflow" (**TODO**: add a link to the section). The section is followed
+by a few more sections covering advanced topics (**TODO**: add a link to the section).
+The :ref:`ingest-api-reference` section at the very end of the document should be used to find complete
+descriptions of the REST services and tools mentioned in the document.
diff --git a/doc/ingest/api/reference/index.rst b/doc/ingest/api/reference/index.rst
@@ -0,0 +1,12 @@
+
+.. _ingest-api-reference:
+
+######################
+Ingest API Reference
+######################
+
+.. toctree::
+   :maxdepth: 4
+
+   rest/index
+   tools
diff --git a/doc/ingest/api/reference/rest/controller/config.rst b/doc/ingest/api/reference/rest/controller/config.rst
@@ -0,0 +1,253 @@
+.. _ingest-config:
+
+Configuring parameters of the ingests
+=====================================
+
+.. _ingest-config-set:
+
+Setting configuration parameters
+--------------------------------
+
+Parameters are set for a database (regardless of the *published* status) using the following service:
+
+..  list-table::
+    :widths: 10 90
+    :header-rows: 0
+
+    * - ``PUT``
+      - ``/ingest/config``
+
+The request object has the following schema:
+
+.. code-block::
+
+    {   "database" : <string>,
+        "SSL_VERIFYHOST" : <number>,
+        "SSL_VERIFYPEER" : <number>,
+        "CAPATH" : <string>,
+        "CAINFO" : <string>,
+        "CAINFO_VAL" : <string>,
+        "PROXY_SSL_VERIFYHOST" : <number>,
+        "PROXY_SSL_VERIFYPEER" : <number>,
+        "PROXY_CAPATH" : <string>,
+        "PROXY_CAINFO" : <string>,
+        "PROXY_CAINFO_VAL" : <string>,
+        "CURLOPT_PROXY" : <string>,
+        "CURLOPT_NOPROXY" : <string>,
+        "CURLOPT_HTTPPROXYTUNNEL" : <number>,
+        "CONNECTTIMEOUT" : <number>,
+        "TIMEOUT" : <number>,
+        "LOW_SPEED_LIMIT" : <number>,
+        "LOW_SPEED_TIME" : <number>,
+        "ASYNC_PROC_LIMIT" : <number>
+    }
+
+Where:
+
+``database`` : *string* : **required**
+
+- The name of a database affected by the operation.
+
+``SSL_VERIFYHOST`` : *number* : **optional** = ``2``
+
+- The flag that tells the system to verify the host of the peer. If the value is set
+  to ``0`` the system will not check the host name against the certificate. Any other value would tell the system
+  to perform the check.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html.
+
+``SSL_VERIFYPEER`` : *number* : **optional** = ``1``
+
+- The flag that tells the system to verify the peer's certificate. If the value is set
+  to ``0`` the system will not check the certificate. Any other value would tell the system to perform the check.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html.
+
+``CAPATH`` : *string* : **optional** = ``/etc/ssl/certs``
+
+- A path to a directory holding multiple CA certificates. The system will use the certificates
+  in the directory to verify the peer's certificate. If the value is set to an empty string the system will not use
+  the certificates.
+
+  Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been
+  configured for the database.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_CAPATH.html.
+
+``CAINFO`` : *string* : **optional** = ``/etc/ssl/certs/ca-certificates.crt``
+
+- A path to a file holding a bundle of CA certificates. The system will use the certificates
+  in the file to verify the peer's certificate. If the value is set to an empty string the system will not use
+  the certificates.
+
+  Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been
+  configured for the database.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_CAINFO.html.
+
+``CAINFO_VAL`` : *string* : **optional** = ``""``
+
+- A value of a certificate bundle for a peer. This parameter is used in those cases when it's
+  impossible to inject the bundle directly into the Ingest workers' environments. If a non-empty value of the parameter
+  is provided then ingest servers will use it instead of the one mentioned (if any) in the above-described
+  attribute ``CAINFO``.
+
+  **Attention**: Values of the attribute are the actual certificates, not file paths like in the case of ``CAINFO``.
+
+``PROXY_SSL_VERIFYHOST`` : *number* : **optional** = ``2``
+
+- The flag that tells the system to verify the host of the proxy. If the value is set
+  to ``0`` the system will not check the host name against the certificate. Any other value would tell the system
+  to perform the check.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_SSL_VERIFYHOST.html.
+
+``PROXY_SSL_VERIFYPEER`` : *number* : **optional** = ``1``
+
+- The flag that tells the system to verify the peer's certificate. If the value is set
+  to ``0`` the system will not check the certificate. Any other value would tell the system to perform the check.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_SSL_VERIFYPEER.html.
+
+``PROXY_CAPATH`` : *string* : **optional** = ``""``
+
+- A path to a directory holding multiple CA certificates. The system will use the certificates
+  in the directory to verify the peer's certificate. If the value is set to an empty string the system will not use
+  the certificates.
+
+  Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been
+  configured for the database.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_CAPATH.html.
+
+``PROXY_CAINFO`` : *string* : **optional** = ``""`` (Built-in system specific)
+
+- A path to a file holding a bundle of CA certificates. The system will use the certificates
+  in the file to verify the peer's certificate. If the value is set to an empty string the system will not use
+  the certificates.
+
+  Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been
+  configured for the database.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_CAINFO.html.
+
+``PROXY_CAINFO_VAL`` : *string* : **optional** = ``""``
+
+- A value of a certificate bundle for a proxy. This parameter is used in those cases when it's
+  impossible to inject the bundle directly into the Ingest workers' environments. If a non-empty value of the parameter
+  is provided then ingest servers will use it instead of the one mentioned (if any) in the above-described
+  attribute ``PROXY_CAINFO``.
+
+  **Attention**: Values of the attribute are the actual certificates, not file paths like in the case of ``PROXY_CAINFO``.
+
+``CURLOPT_PROXY`` : *string* : **optional** = ``""``
+
+- Set the proxy to use for the upcoming request. The parameter should be a null-terminated string
+  holding the host name or dotted numerical IP address. A numerical IPv6 address must be written within ``[brackets]``.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY.html.
+
+``CURLOPT_NOPROXY`` : *string* : **optional** = ``""``
+
+- The string consists of a comma-separated list of host names that do not require a proxy
+  to get reached, even if one is specified.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_NOPROXY.html.
+
+``CURLOPT_HTTPPROXYTUNNEL`` : *number* : **optional** = ``0``
+
+- Set the tunnel parameter to ``1`` to tunnel all operations through the HTTP proxy
+  (set with ``CURLOPT_PROXY``).
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_HTTPPROXYTUNNEL.html.
+
+``CONNECTTIMEOUT`` : *number* : **optional** = ``0`` (never times out)
+
+- The maximum time in seconds that the system will wait for a connection to be established.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html
+
+``TIMEOUT`` : *number* : **optional** = ``0``
+
+- The maximum time in seconds that the system will wait for a response from the server.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_TIMEOUT.html
+
+``LOW_SPEED_LIMIT`` : *number* : **optional** = ``0``
+
+- The transfer speed in bytes per second that the system considers too slow and will abort the transfer.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_LIMIT.html
+
+``LOW_SPEED_TIME`` : *number* : **optional** = ``0``
+
+- The time in seconds that the system will wait for the transfer speed to be above the limit
+  set by ``LOW_SPEED_LIMIT``.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_TIME.html
+
+``ASYNC_PROC_LIMIT`` : *number* : **optional** = ``0``
+
+- The maximum concurrency limit for the number of contributions to be processed in a scope of
+  the database. The actual number of parallel requests may be further lowered by the hard limit specified by
+  the Replication System worker's configuration parameter (``worker``, ``num-async-loader-processing-threads``).
+  The parameter can be adjusted in real time as needed. It gets into effect immediately. Putting ``0`` as a value of
+  the parameter will effectively turn this option off as if it has never been configured for the database.
+
+  This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_TIME.html
+
+  **Note**: The parameter is available as of API version ``14``.
+
+If a request is successfully finished it returns the standard JSON object w/o any additional data but
+the standard completion status.
+
+.. _ingest-config-get:
+
+Retrieving configuration parameters
+-----------------------------------
+
+.. warning::
+    As of version ``14`` of the API, the name of the database is required to be passed in the request's query instead of
+    passing it in the JSON body. The older implementation was wrong.
+
+
+..  list-table::
+    :widths: 10 25 65
+    :header-rows: 1
+
+    * - method
+      - service
+      - query parameters
+    * - ``GET``
+      - ``/ingest/config``
+      - ``database=<string>``
+
+Where the mandatory query parameter ``database`` specifies the name of a database affected by the operation.
+
+If the operation is successfully finished it returns an extended JSON object that has the following schema (in addition
+to the standard status and error reporting attributes):
+
+.. code-block::
+
+    {   "database" : <string>,
+        "SSL_VERIFYHOST" : <number>,
+        "SSL_VERIFYPEER" : <number>,
+        "CAPATH" : <string>,
+        "CAINFO" : <string>,
+        "CAINFO_VAL" : <string>,
+        "PROXY_SSL_VERIFYHOST" : <number>,
+        "PROXY_SSL_VERIFYPEER" : <number>,
+        "PROXY_CAPATH" : <string>,
+        "PROXY_CAINFO" : <string>,
+        "PROXY_CAINFO_VAL" : <string>,
+        "CURLOPT_PROXY" : <string>,
+        "CURLOPT_NOPROXY" : <string>,
+        "CURLOPT_HTTPPROXYTUNNEL" : <number>,
+        "CONNECTTIMEOUT" : <number>,
+        "TIMEOUT" : <number>,
+        "LOW_SPEED_LIMIT" : <number>,
+        "LOW_SPEED_TIME" : <number>,
+        "ASYNC_PROC_LIMIT" : <number>
+    }
+
+The attributes of the response object are the same as the ones described in the section :ref:`ingest-config-set`.