update cmssw plots, add ttbar sample to valid, add multiparticlegun a…

…nd vbf to training (#330) * update cmssw plots, add ttbar sample to valid * update validation notebook * disable ray for now * update README [skip ci] * remove DQM part [skip ci]
jpata · Jun 13, 2024 · 98d59c2 · 98d59c2
1 parent 0791d61
commit 98d59c2
Show file tree

Hide file tree

Showing 14 changed files with 599 additions and 318 deletions.
diff --git a/README.md b/README.md
@@ -19,6 +19,13 @@ MLPF focuses on developing full event reconstruction based on computationally sc
   - results: https://doi.org/10.5281/zenodo.10567397
   - weights: https://huggingface.co/jpata/particleflow/tree/main/clic/clusters/v1.6
 
+### Open datasets:
+The following datasets are available to reproduce the studies. They include full Geant4 simulation and reconstruction based on the CLIC detector. We have no affiliation with the CLIC collaboration, therefore these datasets are to be used only for computational studies and come with no warranty.
+
+- MLPF-CLIC, raw data: https://zenodo.org/records/8260741 or https://www.coe-raise.eu/od-pfr
+- MLPF-CLIC, processed for ML, tracks and clusters: https://zenodo.org/records/8409592
+- MLPF-CLIC, processed for ML, tracks and hits: https://zenodo.org/records/8414225
+
 ## MLPF development in CMS
 
 <p float="left">

diff --git a/mlpf/data_cms/README.md b/mlpf/data_cms/README.md
@@ -93,6 +93,16 @@ vector<reco::PFCandidate>             "particleFlow"              ""
 To test MLPF on higher statistics, it's not practical to redo full reconstruction before the particle flow step.
 We can follow a similar logic as the PF validation, where only the relevant PF sequences are rerun.
 
+We use the following datasets for this:
+```
+/RelValQCD_FlatPt_15_3000HS_14/CMSSW_14_1_0_pre3-PU_140X_mcRun3_2024_realistic_v8_STD_2024_PU-v2/GEN-SIM-DIGI-RAW
+/RelValTTbar_14TeV/CMSSW_14_1_0_pre3-PU_140X_mcRun3_2024_realistic_v8_STD_2024_PU-v2/GEN-SIM-DIGI-RAW
+/RelValQQToHToTauTau_14TeV/CMSSW_14_1_0_pre3-PU_140X_mcRun3_2024_realistic_v8_STD_2024_PU-v2/GEN-SIM-DIGI-RAW
+/RelValSingleEFlatPt2To100/CMSSW_14_1_0_pre3-PU_140X_mcRun3_2024_realistic_v8_STD_2024_PU-v2/GEN-SIM-DIGI-RAW
+/RelValSingleGammaFlatPt8To150/CMSSW_14_1_0_pre3-PU_140X_mcRun3_2024_realistic_v8_STD_2024_PU-v2/GEN-SIM-DIGI-RAW
+/RelValSinglePiFlatPt0p7To10/CMSSW_14_1_0_pre3-PU_140X_mcRun3_2024_realistic_v8_STD_2024_PU-v2/GEN-SIM-DIGI-RAW
+```
+
 #### MINIAOD with PF and MLPF
 The PF validation workflows can be run using the scripts in
 ```
@@ -105,17 +115,5 @@ cd particleflow
 
 The MINIAOD output will be in `$CMSSW_BASE/out/QCD_PU_mlpf` and `$CMSSW_BASE/out/QCD_PU_pf`.
 
-#### DQM plots
-Now the MINIAOD output can be analyzed with the DQM and PF validation scripts:
-```
-./scripts/cmssw/run_dqm.sh $CMSSW_BASE/out
-```
-
-The outputs will be in:
-```
-ls plots
-```
-and can be displayed in a web browser.
-
 ## Generating MLPF training samples
 TODO.
diff --git a/mlpf/data_cms/check_file.py b/mlpf/data_cms/check_file.py
@@ -0,0 +1,8 @@
+import pickle
+import sys
+import bz2
+
+try:
+    data = pickle.load(bz2.BZ2File(sys.argv[1], "rb"), encoding="iso-8859-1")
+except Exception:
+    print(sys.argv[1])
diff --git a/mlpf/heptfds/cms_pf/vbf.py b/mlpf/heptfds/cms_pf/vbf.py
@@ -21,9 +21,10 @@
 class CmsPfVbf(tfds.core.GeneratorBasedBuilder):
     """DatasetBuilder for cms_pf dataset."""
 
-    VERSION = tfds.core.Version("1.7.0")
+    VERSION = tfds.core.Version("1.7.1")
     RELEASE_NOTES = {
         "1.7.0": "Add cluster shape vars",
+        "1.7.1": "Increase stats to 400k events",
     }
     MANUAL_DOWNLOAD_INSTRUCTIONS = """
     rsync -r --progress lxplus.cern.ch:/eos/user/j/jpata/mlpf/tensorflow_datasets/cms/cms_pf_vbf ~/tensorflow_datasets/