-
Notifications
You must be signed in to change notification settings - Fork 17
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added the Ingest Workflow Developer's Guide
- Loading branch information
1 parent
ec926af
commit 48743b4
Showing
16 changed files
with
3,640 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,103 @@ | ||
|
||
.. note:: | ||
|
||
Information in this guide corresponds to the version **38** of the Qserv REST API. Keep in mind | ||
that each implementation of the API has a specific version. The version number will change | ||
if any changes to the implementation or the API that might affect users will be made. | ||
The current document will be kept updated to reflect the latest version of the API. | ||
|
||
##################################### | ||
The Ingest Workflow Developer's Guide | ||
##################################### | ||
|
||
TBC | ||
.. toctree:: | ||
:maxdepth: 4 | ||
|
||
reference/index | ||
|
||
Introduction | ||
============ | ||
|
||
This document presents an API that is available in Qserv for constructing the data ingest applications (also mentioned | ||
in the document as *ingest workflows*). The API is designed to provide a high-performance and reliable mechanism for | ||
ingesting large quantities of data where the high performance or reliability of the ingests is at stake. | ||
The document is intended to be a practical guide for the developers who are building those applications. | ||
It provides a high-level overview of the API, its main components, and the typical workflows that can be built using the API. | ||
|
||
At the very high level, the Qserv Ingest system is comprised of: | ||
|
||
- The REST server that is integrated into the Master Replication Controller. The server provides a collection | ||
of services for managing metadata and states of the new catalogs to be ingested. The server also coordinates | ||
its own operations with Qserv itself and the Qserv Replication System to prevent interferences with those | ||
and minimize failures during catalog ingest activities. | ||
- The Data Ingest services run at each Qserv worker alongside the Replication System's worker services. | ||
The role of these services is to actually ingest the client's data into the corresponding MySQL tables. | ||
The services would also do an additional (albeit, minimal) preprocessing and data transformation (where or when needed) | ||
before ingesting the input data into MySQL. Each worker server also includes its own REST server for processing | ||
the "by reference" ingest requests as well as various metadata requests in the scope of the workers. | ||
|
||
Implementation-wise, the Ingest System heavily relies on services and functions of the Replication System including | ||
the Replication System's Controller Framework, various (including the Configuration) services, and the worker-side | ||
server infrastructure of the Replication System. | ||
|
||
Client workflows interact with the system's services via open interfaces (based on the HTTP protocol, REST services, | ||
JSON data format, etc.) and use ready-to-use tools to fulfill their goals of ingesting catalogs. | ||
|
||
Here is a brief summary of the Qserv Ingest System's features: | ||
|
||
- It introduces the well-defined states and semantics into the ingest process. With that, a process of ingesting a new catalog | ||
now has to go through a sequence of specific steps maintaining a progressive state of the catalog within Qserv | ||
while it's being ingested. The state transitions and the corresponding enforcements made by the system would | ||
always ensure that the catalog would be in a consistent state during each step of the process. | ||
Altogether, this model increases the robustness of the process, and it also makes it more efficient. | ||
|
||
- To facilitate and implement the above-mentioned state transitions the new system introduces a distributed | ||
*tagging* and *checkpointing* mechanism called *super-transactions*. The transactions allow for incremental | ||
updates of the overall state of the data and metadata while allowing to safely roll back to a prior consistent | ||
state should any problem occur during data loading within such transactions. | ||
|
||
- The data tagging capability of the transactions can be also used by the ingest workflows and by | ||
the Qserv administrators for bookkeeping of the ingest activities and for the quality control of | ||
the ingested catalogs. | ||
|
||
- In its very foundation, the system has been designed for constructing high-performance and parallel ingest | ||
workflows w/o compromising the consistency of the ingested catalogs. | ||
|
||
- For the actual data loading, the system offers a few options, inluding pushing data into Qserv directly | ||
via a proprietary binary protocol, HTTP streaming, or ingesting contributions "by reference". In the latter | ||
case, the input data (so called *contributions*) will be pulled by the worker services from remote locations | ||
as instructed by the ingest workflows. The presently supported sources include the object stores (via the HTTP/HTTPS | ||
protocols) and the locally mounted distributed filesystems (via the POSIX protocol). (**TODO**: See Ingesting files directly from | ||
workers for further details). | ||
|
||
- The data loading services also collect various information on the ongoing status of the ingest activities, | ||
abnormal conditions that may occur during reading, interpreting, or loading the data into Qserv, as well | ||
as the metadata for the data that is loaded. The information is retained within the persistent | ||
state of the Replication/Ingest System for the monitoring and debugging purposes. A feedback is provided | ||
to the workflows on various aspects of the ingest activities. The feedback is useful for the workflows to adjust their | ||
behavior and to ensure the quality of the data being ingested. | ||
|
||
- To get further info on this subject, see the sections Error reporting (**TODO**) and Using MySQL warnings (**TODO**). | ||
In addition, the API provides REST services for obtaining metadata on the state of catalogs, tables, distributed | ||
transactions, contribution requests, the progress of the requested operations, etc. | ||
|
||
**What the Ingest System does NOT do**: | ||
|
||
- As per its current implementation (which may change in the future) it does not automatically partition | ||
input files. This task is expected to be a responsibility of the ingest workflows. The only data format | ||
is is presently supported for the table payload are ``CSV`` and ``JSON`` (primarily for ingesting | ||
user-generated data products as explained in :ref:`http-frontend-ingest`). | ||
|
||
- It does not (with an exception of adding an extra leading column ``qserv_trans_id`` required by | ||
the implementation of the previously mentioned *super-transactions*) pre-process the input ``CSV`` | ||
payload sent to the Ingest Data Servers by the workflows for loading into tables. | ||
It's up to the workflows to sanitize the input data and to make them ready to be ingested into Qserv. | ||
|
||
.. list-table:: Title | ||
:widths: 25 25 50 | ||
:header-rows: 1 | ||
More information on the requirements and the low-level technical details of its implementation (unless it's | ||
needed for the purposes of this document's goals) can be found elsewhere. | ||
|
||
* - Heading row 1, column 1 | ||
- Heading row 1, column 2 | ||
- Heading row 1, column 3 | ||
* - Row 1, column 1 | ||
- | ||
- Row 1, column 3 | ||
* - Row 2, column 1 | ||
- Row 2, column 2 | ||
- Row 2, column 3 | ||
It's recommended to read the document sequentially. Most ideas presented in the document are introduced in | ||
a section "An example of a simple workflow" (**TODO**: add a link to the section). The section is followed | ||
by a few more sections covering advanced topics (**TODO**: add a link to the section). | ||
The :ref:`ingest-api-reference` section at the very end of the document should be used to find complete | ||
descriptions of the REST services and tools mentioned in the document. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
|
||
.. _ingest-api-reference: | ||
|
||
###################### | ||
Ingest API Reference | ||
###################### | ||
|
||
.. toctree:: | ||
:maxdepth: 4 | ||
|
||
rest/index | ||
tools |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,253 @@ | ||
.. _ingest-config: | ||
|
||
Configuring parameters of the ingests | ||
===================================== | ||
|
||
.. _ingest-config-set: | ||
|
||
Setting configuration parameters | ||
-------------------------------- | ||
|
||
Parameters are set for a database (regardless of the *published* status) using the following service: | ||
|
||
.. list-table:: | ||
:widths: 10 90 | ||
:header-rows: 0 | ||
|
||
* - ``PUT`` | ||
- ``/ingest/config`` | ||
|
||
The request object has the following schema: | ||
|
||
.. code-block:: | ||
{ "database" : <string>, | ||
"SSL_VERIFYHOST" : <number>, | ||
"SSL_VERIFYPEER" : <number>, | ||
"CAPATH" : <string>, | ||
"CAINFO" : <string>, | ||
"CAINFO_VAL" : <string>, | ||
"PROXY_SSL_VERIFYHOST" : <number>, | ||
"PROXY_SSL_VERIFYPEER" : <number>, | ||
"PROXY_CAPATH" : <string>, | ||
"PROXY_CAINFO" : <string>, | ||
"PROXY_CAINFO_VAL" : <string>, | ||
"CURLOPT_PROXY" : <string>, | ||
"CURLOPT_NOPROXY" : <string>, | ||
"CURLOPT_HTTPPROXYTUNNEL" : <number>, | ||
"CONNECTTIMEOUT" : <number>, | ||
"TIMEOUT" : <number>, | ||
"LOW_SPEED_LIMIT" : <number>, | ||
"LOW_SPEED_TIME" : <number>, | ||
"ASYNC_PROC_LIMIT" : <number> | ||
} | ||
Where: | ||
|
||
``database`` : *string* : **required** | ||
|
||
- The name of a database affected by the operation. | ||
|
||
``SSL_VERIFYHOST`` : *number* : **optional** = ``2`` | ||
|
||
- The flag that tells the system to verify the host of the peer. If the value is set | ||
to ``0`` the system will not check the host name against the certificate. Any other value would tell the system | ||
to perform the check. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYHOST.html. | ||
|
||
``SSL_VERIFYPEER`` : *number* : **optional** = ``1`` | ||
|
||
- The flag that tells the system to verify the peer's certificate. If the value is set | ||
to ``0`` the system will not check the certificate. Any other value would tell the system to perform the check. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html. | ||
|
||
``CAPATH`` : *string* : **optional** = ``/etc/ssl/certs`` | ||
|
||
- A path to a directory holding multiple CA certificates. The system will use the certificates | ||
in the directory to verify the peer's certificate. If the value is set to an empty string the system will not use | ||
the certificates. | ||
|
||
Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been | ||
configured for the database. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_CAPATH.html. | ||
|
||
``CAINFO`` : *string* : **optional** = ``/etc/ssl/certs/ca-certificates.crt`` | ||
|
||
- A path to a file holding a bundle of CA certificates. The system will use the certificates | ||
in the file to verify the peer's certificate. If the value is set to an empty string the system will not use | ||
the certificates. | ||
|
||
Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been | ||
configured for the database. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_CAINFO.html. | ||
|
||
``CAINFO_VAL`` : *string* : **optional** = ``""`` | ||
|
||
- A value of a certificate bundle for a peer. This parameter is used in those cases when it's | ||
impossible to inject the bundle directly into the Ingest workers' environments. If a non-empty value of the parameter | ||
is provided then ingest servers will use it instead of the one mentioned (if any) in the above-described | ||
attribute ``CAINFO``. | ||
|
||
**Attention**: Values of the attribute are the actual certificates, not file paths like in the case of ``CAINFO``. | ||
|
||
``PROXY_SSL_VERIFYHOST`` : *number* : **optional** = ``2`` | ||
|
||
- The flag that tells the system to verify the host of the proxy. If the value is set | ||
to ``0`` the system will not check the host name against the certificate. Any other value would tell the system | ||
to perform the check. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_SSL_VERIFYHOST.html. | ||
|
||
``PROXY_SSL_VERIFYPEER`` : *number* : **optional** = ``1`` | ||
|
||
- The flag that tells the system to verify the peer's certificate. If the value is set | ||
to ``0`` the system will not check the certificate. Any other value would tell the system to perform the check. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_SSL_VERIFYPEER.html. | ||
|
||
``PROXY_CAPATH`` : *string* : **optional** = ``""`` | ||
|
||
- A path to a directory holding multiple CA certificates. The system will use the certificates | ||
in the directory to verify the peer's certificate. If the value is set to an empty string the system will not use | ||
the certificates. | ||
|
||
Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been | ||
configured for the database. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_CAPATH.html. | ||
|
||
``PROXY_CAINFO`` : *string* : **optional** = ``""`` (Built-in system specific) | ||
|
||
- A path to a file holding a bundle of CA certificates. The system will use the certificates | ||
in the file to verify the peer's certificate. If the value is set to an empty string the system will not use | ||
the certificates. | ||
|
||
Putting the empty string as a value of the parameter will effectively turn this option off as if it has never been | ||
configured for the database. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY_CAINFO.html. | ||
|
||
``PROXY_CAINFO_VAL`` : *string* : **optional** = ``""`` | ||
|
||
- A value of a certificate bundle for a proxy. This parameter is used in those cases when it's | ||
impossible to inject the bundle directly into the Ingest workers' environments. If a non-empty value of the parameter | ||
is provided then ingest servers will use it instead of the one mentioned (if any) in the above-described | ||
attribute ``PROXY_CAINFO``. | ||
|
||
**Attention**: Values of the attribute are the actual certificates, not file paths like in the case of ``PROXY_CAINFO``. | ||
|
||
``CURLOPT_PROXY`` : *string* : **optional** = ``""`` | ||
|
||
- Set the proxy to use for the upcoming request. The parameter should be a null-terminated string | ||
holding the host name or dotted numerical IP address. A numerical IPv6 address must be written within ``[brackets]``. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_PROXY.html. | ||
|
||
``CURLOPT_NOPROXY`` : *string* : **optional** = ``""`` | ||
|
||
- The string consists of a comma-separated list of host names that do not require a proxy | ||
to get reached, even if one is specified. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_NOPROXY.html. | ||
|
||
``CURLOPT_HTTPPROXYTUNNEL`` : *number* : **optional** = ``0`` | ||
|
||
- Set the tunnel parameter to ``1`` to tunnel all operations through the HTTP proxy | ||
(set with ``CURLOPT_PROXY``). | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_HTTPPROXYTUNNEL.html. | ||
|
||
``CONNECTTIMEOUT`` : *number* : **optional** = ``0`` (never times out) | ||
|
||
- The maximum time in seconds that the system will wait for a connection to be established. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_CONNECTTIMEOUT.html | ||
|
||
``TIMEOUT`` : *number* : **optional** = ``0`` | ||
|
||
- The maximum time in seconds that the system will wait for a response from the server. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_TIMEOUT.html | ||
|
||
``LOW_SPEED_LIMIT`` : *number* : **optional** = ``0`` | ||
|
||
- The transfer speed in bytes per second that the system considers too slow and will abort the transfer. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_LIMIT.html | ||
|
||
``LOW_SPEED_TIME`` : *number* : **optional** = ``0`` | ||
|
||
- The time in seconds that the system will wait for the transfer speed to be above the limit | ||
set by ``LOW_SPEED_LIMIT``. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_TIME.html | ||
|
||
``ASYNC_PROC_LIMIT`` : *number* : **optional** = ``0`` | ||
|
||
- The maximum concurrency limit for the number of contributions to be processed in a scope of | ||
the database. The actual number of parallel requests may be further lowered by the hard limit specified by | ||
the Replication System worker's configuration parameter (``worker``, ``num-async-loader-processing-threads``). | ||
The parameter can be adjusted in real time as needed. It gets into effect immediately. Putting ``0`` as a value of | ||
the parameter will effectively turn this option off as if it has never been configured for the database. | ||
|
||
This attribute directly maps to https://curl.se/libcurl/c/CURLOPT_LOW_SPEED_TIME.html | ||
|
||
**Note**: The parameter is available as of API version ``14``. | ||
|
||
If a request is successfully finished it returns the standard JSON object w/o any additional data but | ||
the standard completion status. | ||
|
||
.. _ingest-config-get: | ||
|
||
Retrieving configuration parameters | ||
----------------------------------- | ||
|
||
.. warning:: | ||
As of version ``14`` of the API, the name of the database is required to be passed in the request's query instead of | ||
passing it in the JSON body. The older implementation was wrong. | ||
|
||
|
||
.. list-table:: | ||
:widths: 10 25 65 | ||
:header-rows: 1 | ||
|
||
* - method | ||
- service | ||
- query parameters | ||
* - ``GET`` | ||
- ``/ingest/config`` | ||
- ``database=<string>`` | ||
|
||
Where the mandatory query parameter ``database`` specifies the name of a database affected by the operation. | ||
|
||
If the operation is successfully finished it returns an extended JSON object that has the following schema (in addition | ||
to the standard status and error reporting attributes): | ||
|
||
.. code-block:: | ||
{ "database" : <string>, | ||
"SSL_VERIFYHOST" : <number>, | ||
"SSL_VERIFYPEER" : <number>, | ||
"CAPATH" : <string>, | ||
"CAINFO" : <string>, | ||
"CAINFO_VAL" : <string>, | ||
"PROXY_SSL_VERIFYHOST" : <number>, | ||
"PROXY_SSL_VERIFYPEER" : <number>, | ||
"PROXY_CAPATH" : <string>, | ||
"PROXY_CAINFO" : <string>, | ||
"PROXY_CAINFO_VAL" : <string>, | ||
"CURLOPT_PROXY" : <string>, | ||
"CURLOPT_NOPROXY" : <string>, | ||
"CURLOPT_HTTPPROXYTUNNEL" : <number>, | ||
"CONNECTTIMEOUT" : <number>, | ||
"TIMEOUT" : <number>, | ||
"LOW_SPEED_LIMIT" : <number>, | ||
"LOW_SPEED_TIME" : <number>, | ||
"ASYNC_PROC_LIMIT" : <number> | ||
} | ||
The attributes of the response object are the same as the ones described in the section :ref:`ingest-config-set`. |
Oops, something went wrong.