Identify the correct primary vertex from reconstructed tracks at the CMS Experiment using simulated data. The goal is to use the tracks from simulated data to predicate the location of the primary vertex at which they originated. This is done by training a Graph Convolutional Neural Network (GCNN) to be rewarded when the predicated vertex position from the tracks is close to the true vertex position.
The data used is simulated data based on CMS data, created by ./utils/track_generator_clean.py
. The script generates simulated particle tracks based on known CMS physics distributions.
All data is in .zip format and needs to be unzipped first.
-
Input Processing:
- Takes input parameters like number of vertices and maximum tracks per vertex
- Reads ROOT file containing histograms with real physics distributions
- These distributions include: vertex Z positions, number of tracks per vertex, track pT and eta
-
Core Functionality:
- Generates simulated primary vertices with positions sampled from the input distributions
- For each vertex, creates a specific number of tracks with:
- Track parameters (
pT
,eta
) sampled from real physics distributions - Track uncertainties calculated based on eta (polar angle)
- Track
z0
positions modified by resolution effects
- Track parameters (
-
Output Generation:
- Creates JSON files containing the simulated track and vertex information
- The JSON filenames have the format
events_{nVertices}V_{nTracksMax}T_{numTests}.json
. - Each output file describes an event containing:
- List of vertices with their positions
- For each vertex, list of tracks with:
z0
position (track position at point of closest approach)dz0
(uncertainty onz0
)
The script effectively creates realistic test data that mirrors actual particle detector measurements, useful for developing and testing track reconstruction algorithms.
To run ./utils/track_generator_clean.py
, check the following arguments
Arguments | Type | Default | Description |
---|---|---|---|
nVertices |
Integer | 2 |
The number of primary vertices in the event |
nTracksMax |
Integer | 5 |
The maximum number of tracks per primary vertex |
numTests |
Integer | 1000 |
Number of tests to run |
inputdir |
String | "./datasets/" |
Where to find the vertexInfo.root file |
outputdir |
String | "./datasets/train/set1" |
Where to save the output JSON files |
and then open a terminal in the project directory and execute the following command
python ./utils/track_generator_clean.py --nVertices 2 --nTracksMax 5 --numTests 1000 --inputdir "./datasets/vertexInfo.root" --outputdir "./datasets/train/set1"
The data will be in the following tree directory
datasets/
├── test/
│ └── set1
└── train/
├── set0
└── set1
We use ./utils/track_generator_clean.py
to generate two different training sets:
-
The first set (
set0
) has 49,000 different samples with a varying number of vertices and number of tracks per vertex. Specifically, we run the generator fornVertices = [2, 3, 4, 5, 6, 7, 8] nTracksMax = [2, 3, 4, 5, 6, 7, 8] numTests = 1000
Here are the details of the data generated
-
The second set (
set0
) has 1,000 different samples with 60 vertices in each event and 10 tracks per vertex which is inspired by the data CMS produces, shown in the figure below. The following CMS diagram shows the mean number of interactions per crossing, which is essential the mean number of primary vertices per crossing.To get a close dataset, we run the generator for
nVertices = 60 nTracksMax = 10 numTests = 1000
for a total of 1,000 JSON files, having 60,000 vertices and 600,000 tracks in total.
Here are the details of the data generated
The pull distribution has a nice Gaussian form. For this specific dataset, the mean is and the standard deviation is.
Additionally, if we look closely, we can see the shortcomings of the track data generation: position and uncertainty are uncorrelated (probably not true) and uncertainty can take up only on a finite set of values.
The generated data files have the following structure
[
[vertex_position, # VERTEX 1
[
[track_z0, track_dz0_normalized], # TRACK 1
[track_z0, track_dz0_normalized], # TRACK 2
... # up to nTracksMax tracks
]
],
... # repeated nVertices times
]
To make running the generator easier, we run ./utils/run_track_generator.py
that executes the previous command multiple times for the different nVertices
and nTracksMax
. To do so, open the aforementioned file and edit the array of values and the save directory, then run
python ./utils/run_track_generator.py
To run the model on a test set which has realistic data, not only in values, but also in quantity, we generate another version of set1
.
python .\track_generator_clean.py --nVertices 60 --nTracksMax 10 --numTests 1 --output_dir "../datasets/test/set1"
The model directly predicts vertex dz0
) used for vertex position weighting.
The ./src/model.py
file is described as follows:
-
TrackDataset
: A custom PyTorch Geometric dataset class that- Takes a list of file paths containing track data as input
- Processes JSON files containing vertex and track information
- Creates graph objects for each vertex-track combination where:
- Nodes: Track parameters
[z0, dz0_normalized]
- Edges: Connections between tracks with features
[distance, uncertainty_product]
- Target: The vertex position
- Nodes: Track parameters
-
TrackConv
: A custom Graph Convolutional layer that- Inherits from PyTorch Geometric's MessagePassing class
- Uses mean aggregation
- Contains an MLP with:
- Linear layer (
input_channels + 2
->output_channels
) - Layer normalization
- GELU activation
- Dropout
- Linear layer (
- Has a residual connection
-
VertexGNN
: The main Graph Neural Network model with-
Initial node embedding layers:
- Two Linear layers with
hidden_dim
size - Layer normalization and GELU activation after each
- Dropout between layers
- Two Linear layers with
-
Multiple TrackConv layers (
num_layers
):- Each with
hidden_dim
size - Residual connections around each layer
- Layer normalization after each layer
- Each with
-
Final MLP for vertex prediction:
- Linear(
hidden_dim
->hidden_dim
) - Layer normalization + ReLU + Dropout
- Linear(
hidden_dim
->hidden_dim//2
) - Layer normalization + ReLU + Dropout
- Linear(
hidden_dim//2
->1
)
- Linear(
-
The model also includes a physics-based loss function that computes a chi-square value between the predicted vertex position and the track parameters.
This is a message-passing GNN architecture designed specifically for vertex finding, with node features representing track parameters and edge features capturing geometric relationships between tracks. The multiple layers allow the model to iteratively refine vertex position predictions through graph convolutions and message passing.
# Initialize:
1. Create vertex dataset from track files
- Load JSON files containing track parameters and vertex positions
- Create graph objects with:
* Nodes = tracks [z0, dz0_normalized]
* Edges = track pairs [distance, uncertainty_product]
* Target = vertex position
2. Split data:
- 70% train, 15% val, 15% test
- Create DataLoaders with batch_size=32 (user changeable)
3. Setup model and training:
- Initialize VertexGNN(hidden_dim=256, num_layers=5)
- Adam optimizer with lr=1e-4
- OneCycleLR scheduler
- Early stopping with patience=300
# Training loop:
for epoch in range(50):
# Training phase
model.train()
for batch in train_loader:
1. Forward pass:
- Get vertex predictions from VertexGNN
- Track parameters go through node embedding
- Multiple TrackConv layers with message passing
- Final MLP predicts vertex position
2. Compute losses:
- Primary loss: MAE between predicted and true vertices
- Physics loss: Chi-square between predicted vertex and track parameters
- Total loss = primary_loss + 0.1 * physics_loss
3. Optimization:
- Zero gradients
- Backpropagate total loss
- Optimizer step
- Update LR scheduler
# Validation phase
model.eval()
for batch in val_loader:
1. Get predictions:
- Forward pass without gradients
- Store predicted vertices
2. Calculate metrics:
- Validation loss
- Mean absolute error
- Accuracy at different tolerances (1%)
3. Checkpoint management:
- Save if validation loss improves
- Early stop if no improvement for patience epochs
# Final evaluation:
1. Load best model
2. Evaluate on test set:
- Calculate test loss and MAE
- Save final results
- Generate prediction plots
To run the training script, check the following arguments
Arguments | Type | Default | Description |
---|---|---|---|
dataset |
String | Required | The name of the training run |
data_dir |
String | Required | The directory of the JSON files generated by the track generator |
hidden_dim |
Integer | 256 |
The number of hidden dimensions to be added in the model architecture |
num_layers |
Integer | 6 |
The number of convolutional layers embedded in the model architecture |
dropout |
Integer | 0.05 |
The dropout rate |
epochs |
Integer | 100 |
The number of epochs |
batch_size |
Integer | 64 |
The batch size for the training loop |
learning_rate |
Integer | 1e-5 |
The maximum learning rate for the Adam optimizer (varies during training) |
weight_decay |
Integer | 1e-5 |
The decay rate of the learning rate |
patience |
Integer | 50 |
The number of epochs to stop training if a condition on the loss is met |
and then open a terminal in the project directory and execute the following command
python ./src/train.py --dataset INSERT_TITLE --data_dir INSERT_OUTPUT_DIR --hidden_dim 256 --num_layers 6 --dropout 0.05 --epochs 100 --batch_size 64 --learning_rate 0.00001
After running this for the parameters for the two datasets separately, we have
-
The first set (
set0
): The model is stored in./trained_models/set0_hd256_nl6_e100_bs64_lr1e-05_wd1e-05_d0.05.pt
. The validation scatter plot and training metric are given here. -
The second set (
set1
): The model is stored in./trained_models/set1_hd256_nl6_e100_bs64_lr1e-05_wd1e-05_d0.05.pt
. The validation scatter plot and training metric are given here.
As a remark, the points in the graph above are extremely dense since, if you recall, the training sets had a lot of vertices and tracks per vertex. Additionally, we can see how the density tapers off on the sides, which represents the Gaussian pull distribution.
To test our models, simply run
python ./src/test.py --model_dir ../trained_models --model_name set0_hd256_nl6_e100_bs64_lr1e-05_wd1e-05_d0.05 --data_dir ../datasets/test/set1 --data_set set1
and then the output directory will be in ./src/test/set0
, where set0
here refers to the dataset the model was tested on, not the test set name.
We take the two model files generated above, stored at ./trained_models/
and run the test dataset through them.
-
The first model trained on
set0
: The analysis of this model gives an average of 99.81% of predictions being within 1% error.test_loss: 0.13108545241355896 mae: 8.387040843200683 rmse: 0.0036350178997963667 accuracy: 0.9980833333333333 model_config: {'hidden_dim': 256, 'num_layers': 6, 'epochs': 100, 'batch_size': 64, 'learning_rate': '1e-05', 'weight_decay': '1e-05', 'dropout': 0.05}
-
The first model trained on
set1
: The analysis of this model gives an average of 97.25% of predictions being within 1% error.test_loss: 0.12783483794530232 mae: 8.179061130015056 rmse: 0.004267072770744562 accuracy: 0.9724833333333334 model_config: {'hidden_dim': 256, 'num_layers': 6, 'epochs': 100, 'batch_size': 64, 'learning_rate': '1e-05', 'weight_decay': '1e-05', 'dropout': 0.05}
In the folder ./deterministic_annealing
, the file DeterministicAnnealing.cc
is taken from Purdue-TOP/PrimaryVertexing but modified to our needs.
After running g++ DeterministicAnnealer.cc -o DeterministicAnnealing
to compile the C++ code into an .exe file, the program runs and outputs a serializableResponse.json
file, contained in ./deterministic_annealing
.
We will use serializableResponse.json
in addition to the generated serializedEvents.json
file from .utils/track_generator_clean.py
to compare how well this deterministic annealing algorithm performs. The reason is to have a "classical" solution to the problem of vertex assignment that we can use for comparison with our model's performance.
We now run ./deterministic_annealing/deterministic_annealer_analysis.py
to check the predictions from the deterministic annealer in comparison to the true values. This executable will create as many JSON files and plots as there are events. For example, here is what one looks like:
The analysis of the deterministic annealer gives an average of 97.25% of predictions being within 1% error.
As we require more accuracy, the machine learning model falls off faster than the deterministic annealer. Hence, model architecture should be worked on for the next version of this model. From that, we can say that there definitely is more work to be done, and those approaches include:
- Experimenting with different models, specifically, one suggestion might be a transformer encoder to decoder structure
- Using real data and fully incorporating the track information like
pT
,eta
,phi
,d0
, andz0
, within uncertainties - Using quantum annealing to benchmark
[1] https://phys.org/news/2024-09-lhc-luminosity.html [2] https://iopscience.iop.org/article/10.1088/1742-6596/110/9/092009/pdf [3] https://www.slideserve.com/dbernal/vertex-and-track-reconstruction-in-cms-powerpoint-ppt-presentation [4] https://opendata.cern.ch/ [5] https://www.lhc-closer.es/taking_a_closer_look_at_lhc/0.lhc_p_collisions [6] Pattern Recognition, Tracking and Vertex Reconstruction in Particle Detectors by Rudolf Frühwirth and Are Strandlie [7] https://www.youtube.com/watch?v=pLcVVwx4tcw [8] https://indico.cern.ch/event/870237/ [9] https://twiki.cern.ch/twiki/bin/view/CMSPublic/LumiPublicResults#Run_3_charts_of_luminosity