Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tickets/SP-1623: refactor metadata storage in the prenight scheduler simulation archive to improve performance #117

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions batch/compile_prenight_metadata_cache.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/usr/bin/env bash
#SBATCH --account=rubin:developers # Account name
#SBATCH --job-name=auxtel_prenight_daily # Job name
#SBATCH --output=/sdf/data/rubin/user/neilsen/batch/auxtel_prenight_daily/daily.out # Output file (stdout)
#SBATCH --error=/sdf/data/rubin/user/neilsen/batch/auxtel_prenight_daily/daily.err # Error file (stderr)
#SBATCH --partition=milano # Partition (queue) names
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Number of tasks run in parallel
#SBATCH --cpus-per-task=1 # Number of CPUs per task
#SBATCH --mem=16G # Requested memory
#SBATCH --time=1:00:00 # Wall time (hh:mm:ss)

echo "******** START of run_prenight_sims.sh **********"

# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi

# SLAC S3DF - source all files under ~/.profile.d
if [[ -e ~/.profile.d && -n "$(ls -A ~/.profile.d/)" ]]; then
source <(cat $(find -L ~/.profile.d -name '*.conf'))
fi

__conda_setup="$('/sdf/group/rubin/user/neilsen/mambaforge/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
eval "$__conda_setup"
else
if [ -f "/sdf/group/rubin/user/neilsen/mambaforge/etc/profile.d/conda.sh" ]; then
. "/sdf/group/rubin/user/neilsen/mambaforge/etc/profile.d/conda.sh"
else
export PATH="/sdf/group/rubin/user/neilsen/mambaforge/bin:$PATH"
fi
fi
unset __conda_setup

if [ -f "/sdf/group/rubin/user/neilsen/mambaforge/etc/profile.d/mamba.sh" ]; then
. "/sdf/group/rubin/user/neilsen/mambaforge/etc/profile.d/mamba.sh"
fi

mamba activate prenight
export AWS_PROFILE=prenight
WORK_DIR=$(date '+/sdf/data/rubin/user/neilsen/batch/compile_prenight_metadata/%Y-%m-%dT%H%M%S' --utc)
echo "Working in $WORK_DIR"
mkdir ${WORK_DIR}
printenv > env.out
cd ${WORK_DIR}
compile_sim_archive_metadata_resource --append
echo "******* END of compile_prenight_metadata_cache.sh *********"

23 changes: 23 additions & 0 deletions docs/archive.rst
Original file line number Diff line number Diff line change
Expand Up @@ -160,6 +160,29 @@ Optional (for debugging or speculative future uses only) file types listed above
``statistics``
Basic statistics for the visit database.

Metadata cache
--------------

Reading each ``sim_metadata.yaml`` individually when loading metadata for a large number of simulations can be slow.
Therefore, metadata for sets of simulations can be compiled into a ``compiled_metadata_cache.h5`` file.
This file stores four tables in `hdf5` format: ``simulations``, ``files``, ``kwargs``, and ``tags``.
Each of these tables is indexed by the URI of a simulation.

The ``files`` table contains one column for each key in the ``files`` dictionary in the yaml metadata file for the simulation, providing the metadata needed to reconstruct this element of the dictionary.

The ``kwargs`` table contains one column for each key in the ``sim_runner_kwargs`` dictionary in the yaml metadata file for the simulation, providing the metadata needed to reconstruct this element of the dictionary.
If a keyword argument is not set, an `numpy.nan` value is stored in the table.

The ``tags`` table contains one column: ``tag``, and contains one row for each tag in each simulation.

The ``simulations`` table contains one column for every other keyword found in the metadata yaml files.
If a keyword argument is not set, an `numpy.nan` value is stored in the table.

The ``compile_sim_archive_metadata_resource`` command in ``rubin_scheduler`` maintains the ``compiled_metadata_cache.h5`` file in an archive.
By default, it reads every ``sim_metadata.yaml`` file in the archive and builds a corresponding cache hdf5 file from scratch.
If called with an ``--append`` flag, it reads an existing metadata cache file, reads ``sim_metadata.yaml`` files for simulations more recently added than the last file in the existing cache, appends them to the previous results from the cache, and writes the result to the cache.
The ``append`` flag therefore speeds up the update considerably, but does not update the cache for any changes to previously added simulations (including deletions).

Automatic archiving of generated data
-------------------------------------

Expand Down
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ dev = [
scheduler_download_data = "rubin_scheduler.data.scheduler_download_data:scheduler_download_data"
rs_download_sky = "rubin_scheduler.data.rs_download_sky:rs_download_sky"
archive_sim = "rubin_scheduler.sim_archive:make_sim_archive_cli"
compile_sim_archive_metadata_resource = "rubin_scheduler.sim_archive.sim_archive:compile_sim_archive_metadata_cli"
prenight_sim = "rubin_scheduler.sim_archive:prenight_sim_cli"
scheduler_snapshot = "rubin_scheduler.sim_archive:make_scheduler_snapshot_cli"

Expand Down
Loading
Loading