Skip to content

Commit

Permalink
Added the Ingest Workflow Developer's Guide
Browse files Browse the repository at this point in the history
  • Loading branch information
iagaponenko committed Oct 11, 2024
1 parent ec926af commit 9c5d2ab
Show file tree
Hide file tree
Showing 29 changed files with 4,407 additions and 117 deletions.
Binary file added doc/_static/ingest-transaction-fsm.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/_static/ingest-transaction-fsm.pptx
Binary file not shown.
Binary file added doc/_static/ingest-transactions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added doc/_static/ingest-transactions.pptx
Binary file not shown.
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
r"^https://rubinobs.atlassian.net/wiki/",
r"^https://rubinobs.atlassian.net/browse/",
r"^https://www.slac.stanford.edu/",
r"^/_images/",
]

html_additional_pages = {
Expand Down
7 changes: 7 additions & 0 deletions doc/ingest/api/advanced/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@

.. _ingest-api-advanced:

==================
Advanced Scenarios
==================

265 changes: 265 additions & 0 deletions doc/ingest/api/appendix/index.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,265 @@
.. _ingest-api-appendix:

========
APPENDIX
========

.. _ingest-api-appendix-indexes:

Managing indexes of MySQL tables at Qserv workers
-------------------------------------------------

.. _ingest-api-appendix-partitioning:

Partitioning data of the RefMatch tables
-----------------------------------------

Introduction
^^^^^^^^^^^^

The ``RefMatch`` tables are a special class of tables that are designed to match rows between two independent *director* tables
belonging to the same or different catalogs. In this case, there is no obvious 1-to-1 match between rows of the director tables.
Instead, the pipelines compute a (spatial) match table between the two that provides a many-to-many relationship between both tables.
The complication is that a match record might reference a row (an *object*) and reference row that fall on opposite sides of
the partition boundary (into different chunks). Qserv deals with this by taking advantage of the overlap that must be stored
alongside each partition (this overlap is stored so that Qserv can avoid inter-worker communication when performing
spatial joins on the fly).

Since the ``RefMatch`` tables are also partitioned tables the input data (CSV) of the tables have to be partitioned into chunks.
In order to partition the RefMatch tables one would have to use a special version of the partitioning tool sph-partition-matches.
A source code of the tool is found in the source tree of Qserv: https://github.com/lsst/qserv/blob/main/src/partition/sph-partition-matches.cc.
The corresponding binary is built and placed into the binary Docker image of Qserv.

Here is an example illustrating how to launch the tool from the container:

.. code-block:: bash
% docker run -it qserv/lite-qserv:2022.9.1-rc1 sph-partition-matches --help
sph-partition-matches [options]
The match partitioner partitions one or more input CSV files in
preparation for loading by database worker nodes. This involves assigning
both positions in a match pair to a location in a 2-level subdivision
scheme, where a location consists of a chunk and sub-chunk ID, and
outputting the match pair once for each distinct location. Match pairs
are bucket-sorted by chunk ID, resulting in chunk files that can then
be distributed to worker nodes for loading.
A partitioned data-set can be built-up incrementally by running the
partitioner with disjoint input file sets and the same output directory.
Beware - the output CSV format, partitioning parameters, and worker
node count MUST be identical between runs. Additionally, only one
partitioner process should write to a given output directory at a
time. If any of these conditions are not met, then the resulting
chunk files will be corrupt and/or useless.
\_____________________ Common:
-h [ --help ] Demystify program usage.
-v [ --verbose ] Chatty output.
-c [ --config-file ] arg The name of a configuration file
containing program option values in a
...
The tool has two parameters specifying the locations of the input (CSV) file and the output folder where
the partitioned products will be stored:

.. code-block:: bash
% sph-partition-matches --help
..
\_____________________ Output:
--out.dir arg The directory to write output files to.
\______________________ Input:
-i [ --in.path ] arg An input file or directory name. If the
name identifies a directory, then all
the files and symbolic links to files
in the directory are treated as inputs.
This option must be specified at least
once.
.. hint::

If the tool is launched via the docker command as was shown above, one would have to mount the corresponding
host paths into the container.

All tables, including both *director* tables and the ``RefMatch`` table itself, have to be partitioned using
the same values of the partitioning parameters, including:

- The number of stripes
- The number of sub-stripes
- The overlap radius

Values of the partitioning parameters should be specified using the following options (the default values shown below are meaningless
for any production scenario):

.. code-block:: bash
--part.num-stripes arg (=18) The number of latitude angle stripes to
divide the sky into.
--part.num-sub-stripes arg (=100) The number of sub-stripes to divide
each stripe into.
--part.overlap arg (=0.01) Chunk/sub-chunk overlap radius (deg).
The next sections present two options for partitioning the input data.

The spatial match within the given overlap radius
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is the most reliable way of partitioning the input data of the match tables. It is available when
the input rows of the match table carry the exact spatial coordinates of both matched rows (from the corresponding
*director* tables).

In this scenario, the input data file (``CSV``) is expected to have 4 columns representing the spatial coordinates
of the matched rows from the *director* tables on the 1st ("left") and on the 2nd ("righ"). Roles and sample names
of the columns are presented in the table below:

.. list-table::
:widths: 70 30
:header-rows: 1

* - role
- column name

* - The *right ascent* coordinate (*longitude*) of the 1st matched entity (from the 1st *director* table).
- ``dir1_ra``

* - The *declination* coordinate (*latitude*) of the 1st matched entity (from the 1st director table).
- ``dir1_dec``

* - The *right ascent* coordinate (*longitude*) of the 2nd matched entity (from the 2nd *director* table).
- ``dir2_ra``

* - The *declination* coordinate (*latitude*) of the 2nd matched entity (from the 2nd director table).
- ``dir2_dec``

The names of these columns need to be passed to the partitioning tool using two special parameters:

.. code-block:: bash
% sph-partition-matches \
--part.pos1="dir1_ra,dir1_dec"\
--part.pos2="dir2_ra,dir2_dec"
.. note:
The order of the columns in each packed pair pf columns is important. The names must be separated by commas.
When using this technique for partitioning the match tables, it's required that the input CSV file(s) had at least those 4 columns
mentioned above. The actual number of columns could be larger. Values of all additional will be copied into the partitioned
products (the chunk files). The original order of the columns will be preserved.

Here is an example of a sample ``CSV`` file that has values of the above-described spatial coordinates in the first 4 columns
and the object identifiers of the corresponding rows from the matched *director* tables in the last 2 columns:

.. code-block::
10.101,43.021,10.101,43.021,123456,6788404040
10.101,43.021,10.102,43.023,123456,6788404041
The last two columns are meant to store values of the following columns:

.. list-table::
:widths: 70 30
:header-rows: 1

* - role
- column name

* - The unique object identifier of the 1st *director* table.
- ``dir1_objectId``

* - The unique object identifier of the 2nd *director* table.
- ``dir2_objectId``

The input CSV file shown above could be also presented in the tabular format:

.. list-table:: Data Table
:widths: 10 10 10 10 10 10
:header-rows: 1

* - dir1_ra
- dir1_dec
- dir2_ra
- dir2_dec
- dir1_objectId
- dir2_objectId
* - 0.101
- 43.021
- 10.101
- 43.021
- 123456
- 6788404040
* - 0.101
- 43.021
- 10.102
- 43.023
- 123456
- 6788404041

Note that this is actually a 1-to-2 match, in which a single object (``123456``) of the 1st director has two matched
objects (``6788404040`` and ``6788404041``) in the 2nd director. Also, note that the second matched object has slightly
different spatial coordinates than the first one. If the value of the overlap parameter is bigger than the difference
between the coordinates then the tool will be able to match the objects successfully. For example, this would work if
a value of the overlap was set to ``0.01``. Otherwise, no match will be made and the row will be ignored by the tool.

.. _warning:

It is assumed that the input data of the ``RefMatch`` tables are correctly produced by the data processing
pipelines. Verifying the quality of the input data is beyond the scope of this document. However, one might
consider writing a special tool for pre-scanning the input files and finding problems in the files.

Here is the complete practical example of how to run the tool with the assumptions made above:

.. code-block:: bash
% cat in.csv
10.101,43.021,10.101,43.021,123456,6788404040
10.101,43.021,10.102,43.023,123456,6788404041
% cat config.json
{
"part":{
"num-stripes":340.
"num-sub-stripes":3,
"overlap":0.01,
"pos1":"dir1_ra,dir1_dec",
"pos2":"dir2_ra,dir2_dec"
},
"in":{
"csv":{
"null":"\\N",
"delimiter":",",
"field":[
"dir1_ra",
"dir1_dec"
"dir2_ra",
"dir2_dec",
"dir1_objectId",
"dir2_objectId"
]
}
},
"out":{
"csv":{
"null":"\\N",
"delimiter":",",
"escape":"\\",
"no-quote":true
}
}
}
% mkdir chunks
% sph-partition-matches -c config.json --in.path=in.csv --out.dir=chunks/
Partitioning using index maps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. note::

This section is under construction. Only the basic idea is presented here.

This is an alternative way of partitioning the input data of the match tables. It is available when the input rows of the match table
do not carry the exact spatial coordinates of both matched rows (from the corresponding *director* tables). Instead, the input data
has to carry the unique object identifiers of the matched rows. The tool will use the object identifiers to find the spatial coordinates
of the matched rows in the *director* tables. The tool will use the index maps of the *director* tables to find the spatial coordinates
of the matched rows.
Loading

0 comments on commit 9c5d2ab

Please sign in to comment.