Skip to content

Commit

Permalink
Add user guide for new clustering options
Browse files Browse the repository at this point in the history
  • Loading branch information
jbeilstenedmands committed Mar 27, 2024
1 parent 983a8a2 commit c92339e
Show file tree
Hide file tree
Showing 3 changed files with 84 additions and 0 deletions.
Binary file added doc/sphinx/figures/multiplex-dendrogram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions doc/sphinx/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
introductory_example
insulin_tutorial
Multi-crystal data <multi_crystal>
Multi-crystal data reduction (xia2.multiplex) <multiplex-multi-crystal>
Serial-crystallography <serial_crystallography>
program_output
parameters
Expand Down
83 changes: 83 additions & 0 deletions doc/sphinx/multiplex-multi-crystal.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
++++++++++++++++++++++++++++++++++++++++++++++++
Multi-crystal data reduction with xia2.multiplex
++++++++++++++++++++++++++++++++++++++++++++++++

xia2.multiplex is a DIALS-based data reduction pipeline for combining integrated data from hundreds of
small-wedge rotation datasets. The input to the pipeline is DIALS integrated datafiles
(i.e. ``integrated.expt`` and ``integrated.refl`` files). xia2.multiplex performs the following routine: unit cell filtering, laue group analysis, unit cell
refinement, scaling, resolution analysis, space group analysis and merging. Additional non-isomorphism analysis is peformed and
dataset statistics and clustering are presented in the ``xia2.multiplex.html`` report.
For full details, see the publication at https://doi.org/10.1107/S2059798322004399 .

Although xia2.multiplex will automatically determine the resolution and space group, these can be manually overridden by setting the options
``resolution.d_min`` and ``symmetry.space_group``.

One feature of xia2.multiplex is the ability to optionally trigger further processing of subsets of the data.
This guide provides an updated description of the different clustering options, which have been updated to use more intuitively named
options in xia2/DIALS versions greater than v3.18, as well as incorporating some additional features developed since this initial publication.

---------------------------------------
Scaling and merging of dataset clusters
---------------------------------------
*To trigger further scaling and merging of clusters, set the option*
``output_clusters=True``.

Datasets can be clustered on two metrics; correlation coefficient clustering, based on pairwise correlations between datasets, and 'cos-angle' clustering, based on angular separation of
datasets in the cosym coordinate space.
The clustering method is controlled by the parameter ``clustering.method``, which can be set to ``cos_angle``, ``correlation`` or ``cos_angle+correlation``.
For the method(s) chosen, a number of clusters up to ``max_output_clusters`` (default value 10) will be scaled and merged, if they meet the following criteria which are controlled via the program parameters:

1. They have at least a completeness of ``min_completeness`` (default value 0).
2. They have at least a multiplicity of ``min_multiplicity`` (default value 0).
3. The number of datasets in the cluster is at least ``min_cluster_size`` (default value 2)
4. The maximum height of the cluster (height being the value of the dendrogram height as shown in the xia2.multiplex clustering output) is below ``max_cluster_height`` (default value 100).

If the clustering method is set to ``cos_angle+correlation``, then the max height for the two clustering methods are controlled individually with the parameters ``max_cluster_height_cc`` and ``max_cluster_height_cos``.

The default ``max_cluster_height`` values are large, such that the output clusters will typically be the `n=` ``max_output_clusters`` largest clusters. The dendrogram height values will be dataset specific, to choose
a particular value one can look at the **Intensity Clustering** section of a ``xia2.multiplex.html`` report:

.. image:: figures/multiplex-dendrogram.png

As shown in this image, the height of a cluster is the point at which it separates into two smaller clusters in the dendrogram. So in this
example, the largest cluster splits at a height of 0.072, and these two clusters each split again around a height of 0.018.
So if clustering was run with ``max_output_clusters=2``, one would get the two largest clusters (clusters 13 and 14 here). But if the ``max_cluster_height`` were set to a value just below 0.018, one would obtain
the two smaller clusters with heights 0.0075 and 0.0032 (clusters 12 and 11).

-------------------------
Distinct cluster analysis
-------------------------
The analysis described above does not guarantee that clusters will not share some individual datasets; this is not necessarily a problem, a typical use case for xia2.multiplex is generating a number
of similar clusters containing similar datasets, to compare the effects of removing a number of outlier datasets on the overall statistics.
An alternative use case is where one is interested in distinct clusters of datasets, i.e. clusters that do not share any individual datasets, which may correspond to distinct crystal strutures
(this is the kind of clustering structure demonstrated in the image above). While this can be handled using the options above for small or simple cases, for large datasets the dendrogram structure
becomes very complex, with many dendrogram subtructures and branches. As such, the options above do not guarantee selection and evaluation of distinct clusters in a timely manner.

To generate output containing distinct clusters `instead` of the output above, one can use the option ``clustering.analysis=True``.
In this case, individual clusters must still meet the four criteria above, then an analysis is performed to determine distinct clusters that do not share any individual datasets.
These clusters are then individually scaled and merged.

---------------------
Scaling and filtering
---------------------
An alternative means of filtering data within xia2.multiplex is to use :math:`{\Delta}CC1/2` filtering, as described in the publication.
This is separate to any cos-angle or correlation clustering analysis and is also optional, not being run by default.
To trigger this, use the option ``filtering.method=deltacchalf``. In this method, the :math:`{\Delta}CC1/2` is calculated for a group of images: groups with a :math:`{\Delta}CC1/2`
below ``deltacchalf.stdcutoff`` are removed (standard deviations below the mean, default value 4.0). A group is either a group of images or an individual dataset, the choice is made
with the parameter ``deltacchalf.mode=dataset`` or ``deltacchalf.mode=image_group`` (dataset is the default). If using the ``image_group`` mode, one must choose the number of images in each group with
the ``deltacchalf.group_size`` parameter (default value 10).
The filtering starts on the combined scaled dataset, and several cycles of repeated scaling and filtering are performed. This stops when one of the following criteria are met:

1. The number of cycles reaches ``deltacchalf.max_cycles`` (default 6).
2. The percentage of reflections removed exceeds ``deltacchalf.max_percent_removed`` (default 10).
3. The completess drops below ``deltacchalf.min_completeness`` (default 0).
4. No groups are removed in the latest cycle of filtering.

A merging statistics report for filtered dataset will be generated and displayed in the **Filtered** tab in the ``xia2.multiplex.html`` report.
Plots of changes in statistics during the scaling and filtering cycles can be found in the **Scaling and filtering plots** section in the **Summary** tab.

**Key points:**

- Turn on cycles of :math:`{\Delta}CC1/2` filtering + scaling with the option ``filtering.method=deltacchalf``.
- The ``deltacchalf.stdcutoff`` parameter is the main way to control the amount of data that is filtered out. Setting this to a lower number means that more data is filtered at each step.
- In the case of radiation damage towards the end of sweeps, it may be better to just exclude the end of sweeps rather than full sweeps; this is an ideal use case for the ``deltacchalf.mode=image_group`` option.

0 comments on commit c92339e

Please sign in to comment.