add snowstorm_dataset and IceCubehosted class #783

sevmag · 2025-01-27T16:29:10Z

Addition of Curated Datasets hosted on the IceCube cluster. (Download requires Username and Password)
Implementation of the curated IceCube-hosted Snowstorm dataset

RasmusOrsoe · 2025-01-31T09:13:59Z

src/graphnet/datasets/snowstorm_dataset.py

+        validation_dataloader_kwargs: Optional[Dict[str, Any]] = None,
+        test_dataloader_kwargs: Optional[Dict[str, Any]] = None,
+    ):
+        """Initialize SnowStorm dataset."""


I think we should explain the arguments here. Most can be repeated from the parent class but run_ids is new.

Yes, I agree!

RasmusOrsoe

Thank you for this very clean contribution @sevmag!

Before we proceed, do you have comments on this @Aske-Rosted? If you have a wiki for the dataset/conversion, maybe we should link to that instead of the generic snowstorm wiki?

RasmusOrsoe · 2025-01-31T09:16:37Z

src/graphnet/datasets/snowstorm_dataset.py

+            assert match
+            run_id = match.group(1)
+
+            query_df = query_database(


How fast is this query in your experience? If the database is large, it might take minutes to execute. In that case, we could consider providing a .parquet file with the ids in them.

Yeah, you are right. It depends on the files, but some take over 2 minutes, and it's probably the query.

UPDATE: The bottleneck of initializing this SnowStormDataset implementation is mainly in the initialization called here (especially when selecting RunIDs with a lot of files)

graphnet/src/graphnet/data/curated_datamodule.py

Lines 76 to 85 in 79d7baf

# Instantiate

super().__init__(

dataset_reference=dataset_ref,

dataset_args=dataset_args,

train_dataloader_kwargs=train_dataloader_kwargs,

validation_dataloader_kwargs=validation_dataloader_kwargs,

test_dataloader_kwargs=test_dataloader_kwargs,

selection=selec,

test_selection=test_selec,

)

Contrary to my belief that it was the prepare_args function snipped referenced above. Therefore, I decided to stick to this version, keeping in mind that one can make the prepare_args function more efficient.

RasmusOrsoe · 2025-01-31T09:31:09Z

src/graphnet/datasets/snowstorm_dataset.py

+    """
+
+    _experiment = "IceCube SnowStorm dataset"
+    _creator = "Severin Magel"


Should probably mention @Aske-Rosted here :-)

True! Changed it

RasmusOrsoe · 2025-01-31T09:37:49Z

src/graphnet/datasets/snowstorm_dataset.py

+        """Initialize SnowStorm dataset."""
+        self._run_ids = run_ids
+        self._zipped_files = [
+            os.path.join(self._data_root_dir, f"{s}.tar.gz") for s in run_ids


I think there's currently 28 valid run id's, which is a subset of the overall available sets. I think the current code assumes the user knows this. Perhaps it would be wise to include an assertion that checks that the user-given ids actually exists.

add snowstorm_dataset and IceCubehosted class

c31582f

RasmusOrsoe reviewed Jan 31, 2025

View reviewed changes

runID assertion and PR suggestions

78843af

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add snowstorm_dataset and IceCubehosted class #783

add snowstorm_dataset and IceCubehosted class #783

sevmag commented Jan 27, 2025

RasmusOrsoe Jan 31, 2025

sevmag Feb 6, 2025

RasmusOrsoe left a comment •

edited

Loading

RasmusOrsoe Jan 31, 2025

sevmag Feb 6, 2025

sevmag Feb 11, 2025

RasmusOrsoe Jan 31, 2025

sevmag Feb 6, 2025

RasmusOrsoe Jan 31, 2025

	# Instantiate
	super().__init__(
	dataset_reference=dataset_ref,
	dataset_args=dataset_args,
	train_dataloader_kwargs=train_dataloader_kwargs,
	validation_dataloader_kwargs=validation_dataloader_kwargs,
	test_dataloader_kwargs=test_dataloader_kwargs,
	selection=selec,
	test_selection=test_selec,
	)

add snowstorm_dataset and IceCubehosted class #783

Are you sure you want to change the base?

add snowstorm_dataset and IceCubehosted class #783

Conversation

sevmag commented Jan 27, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RasmusOrsoe left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RasmusOrsoe left a comment •

edited

Loading