diff --git a/README.md b/README.md index 407fc0a..55f82bf 100644 --- a/README.md +++ b/README.md @@ -153,91 +153,9 @@ concatenation is important. have been created previously in the output file and their names must be the same as the one used in '-k' option. -## Sample input and output files -* There are four sample input HDF5 files provided in folder `examples`. - + examples/sample_r11981_s06.gz - + examples/sample_r11981_s07.gz - + examples/sample_r11981_s08.gz - + examples/sample_r11981_s09.gz -* These sample files are previously compressed. Run command 'make' will - uncompress them into HDF5 files, or run command below to uncompress them. - ```console - % gzip -dc examples/sample_r11981_s06.gz > examples/sample_r11981_s06.h5 - % gzip -dc examples/sample_r11981_s07.gz > examples/sample_r11981_s07.h5 - % gzip -dc examples/sample_r11981_s08.gz > examples/sample_r11981_s08.h5 - % gzip -dc examples/sample_r11981_s09.gz > examples/sample_r11981_s09.h5 - ``` -* Sample run commands - ```console - % mpiexec -n 2 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 - % mpiexec -n 4 ./ph5_concat -i examples/sample_list.txt -o sample_output.h5 -k evt - ``` - The output shown on screen is stored in `examples/sample_stdout.txt`. -* Sample output files - + The output files from concatenating the 4 sample files are available in - `examples/sample_output.h5.gz` whose metadata dumped from command below is - also available in `examples/sample_output.metadata`. - ```console - % gzip -dc examples/sample_output.h5.gz > sample_output.h5 - % h5dump -Hp sample_output.h5 - ``` - -## An example timing output from a run on Cori using 128 MPI processes. - ```console - % srun -n 128 ./ph5_concat -i ./nd_list_128.txt -o /scratch1/FS_1M_128/nd_out.h5 -b 512 -k evt - - Number of input HDF5 files: 128 - Input directory name: /global/cscratch1/sd/wkliao/FS_1M_8 - Output file name: /global/cscratch1/sd/wkliao/FS_1M_128/nd_out.h5 - Output datasets are compressed with level 6 - Read metadata from input files takes 1.2776 seconds - Create output file + datasets takes 25.7466 seconds - Concatenating 1D datasets takes 158.8101 seconds - Writ partition key datasets takes 14.0372 seconds - Concatenating 2D datasets takes 114.4464 seconds - Close input files takes 0.0037 seconds - Close output files takes 0.4797 seconds - ------------------------------------------------------------- - Input directory name: /scratch/FS_1M_8 - Number of input HDF5 files: 128 - Output HDF5 file name: /scratch1/FS_1M_128/nd_out.h5 - Parallel I/O strategy: 2 - Use POSIX I/O to open file: ON - POSIX In-memory I/O: ON - 1-process-create-followed-by-all-open: OFF - Chunk caching for raw data: ON - GZIP level: 6 - Internal I/O buffer size: 512.0 MiB - Dataset used to produce partition key: evt - Name of partition key datasets: evt.seq - ------------------------------------------------------------- - Number of groups: 999 - Number of non-zero-sized groups: 108 - Number of groups have partition key: 108 - Total number of datasets: 17971 - Total number of non-zero datasets: 2795 - ------------------------------------------------------------- - Number of MPI processes: 128 - Number calls to MPI_Allreduce: 3 - Number calls to MPI_Exscan: 2 - ------------------------------------------------------------- - H5Dcreate: 25.6772 - H5Dread for 1D datasets: 1.8583 - H5Dwrite for 1D datasets: 170.2729 - H5Dread for 2D datasets: 19.9304 - H5Dwrite for 2D datasets: 93.9265 - H5Dclose for input datasets: 0.0737 - H5Dclose for output datasets: 0.0516 - ------------------------------------------------------------- - Read metadata from input files: 1.2782 - Create output file + datasets: 25.7466 - Concatenate small datasets: 158.8102 - Write to partition key datasets: 14.0372 - Concatenate large datasets: 114.4520 - Close input files: 0.0124 - Close output files: 0.4799 - End-to-end: 314.8095 - ``` +## Run example +An example run with small input files for illustration is available in +folder [examples](./examples). ## Publications * Sunwoo Lee, Kai-yuan Hou, Kewei Wang, Saba Sehrish, Marc Paterno, diff --git a/examples/README.md b/examples/README.md new file mode 100644 index 0000000..0ed72a7 --- /dev/null +++ b/examples/README.md @@ -0,0 +1,122 @@ +# Examples of running 'ph5concat' + +This folder contains examples of running `ph5concat`, including a set of small +input files, run commands, concatenated output file, and metadata of input and +out files. + +# Input files +* There are four input HDF5 files, each containing a small number of dataset + extracted from neutrino simulation data for illustrative purpose. + + sample_r11981_s06.gz + + sample_r11981_s07.gz + + sample_r11981_s08.gz + + sample_r11981_s09.gz + +* These input files are compressed. Run command 'make' will uncompress them + into HDF5 files. They can also be uncompressed manually by commands below. + ```console + % gzip -dc sample_r11981_s06.gz > sample_r11981_s06.h5 + % gzip -dc sample_r11981_s07.gz > sample_r11981_s07.h5 + % gzip -dc sample_r11981_s08.gz > sample_r11981_s08.h5 + % gzip -dc sample_r11981_s09.gz > sample_r11981_s09.h5 + ``` + +* Metadata of input files + * Metadata of individual files can be retrieved by running command 'h5ls -r' + or 'h5dump -H'. + * The metadata of files 'sample_r11981_s06.h5' and 'sample_r11981_s07.h5' is + shown below. +
+ +
+ * In these examples, each file contains 4 groups at the root level, namely + 'neutrino', 'rec.me.trkkalman', 'rec.training.cvnmaps', and 'spill'. The + number of groups and their names must be identical among all input files to + be concatenated. + * In an input file, the number of datasets in a group can be different from + another group. + * Given a group, the number of datasets and their names it contains must be + identical among all input files. However, the size of first dimension of + datasets can be different. + + +* Run commands + ```console + % mpiexec -n 2 ../ph5_concat -i sample_list.txt -o sample_output.h5 + % mpiexec -n 4 ../ph5_concat -i sample_list.txt -o sample_output.h5 -k evt + ``` + When completed, the output shown on screen is available in + [sample_stdout.txt](./sample_stdout.txt). + +* Concatenated output file + + The concatenated output file is provided in `sample_output.h5.gz`. + + The metadata retrieved from 'h5dump' command is also available in + `sample_output.metadata`. + ```console + % gzip -dc sample_output.h5.gz > sample_output.h5 + % h5dump -Hp sample_output.h5 + ``` + * A short version of the metadata is shown below. ++ +
+ +## An example output from a run concatenating 128 files +Below is an example timing output from a larger run on Cori using 128 MPI +processes. +```console + % srun -n 128 ./ph5_concat -i ./nd_list_128.txt -o /scratch1/FS_1M_128/nd_out.h5 -b 512 -k evt + + Number of input HDF5 files: 128 + Input directory name: /global/cscratch1/sd/wkliao/FS_1M_8 + Output file name: /global/cscratch1/sd/wkliao/FS_1M_128/nd_out.h5 + Output datasets are compressed with level 6 + Read metadata from input files takes 1.2776 seconds + Create output file + datasets takes 25.7466 seconds + Concatenating 1D datasets takes 158.8101 seconds + Write partition key datasets takes 14.0372 seconds + Concatenating 2D datasets takes 114.4464 seconds + Close input files takes 0.0037 seconds + Close output files takes 0.4797 seconds + ------------------------------------------------------------- + Input directory name: /scratch/FS_1M_8 + Number of input HDF5 files: 128 + Output HDF5 file name: /scratch1/FS_1M_128/nd_out.h5 + Parallel I/O strategy: 2 + Use POSIX I/O to open file: ON + POSIX In-memory I/O: ON + 1-process-create-followed-by-all-open: OFF + Chunk caching for raw data: ON + GZIP level: 6 + Internal I/O buffer size: 512.0 MiB + Dataset used to produce partition key: evt + Name of partition key datasets: evt.seq + ------------------------------------------------------------- + Number of groups: 999 + Number of non-zero-sized groups: 108 + Number of groups have partition key: 108 + Total number of datasets: 17971 + Total number of non-zero datasets: 2795 + ------------------------------------------------------------- + Number of MPI processes: 128 + Number calls to MPI_Allreduce: 3 + Number calls to MPI_Exscan: 2 + ------------------------------------------------------------- + H5Dcreate: 25.6772 + H5Dread for 1D datasets: 1.8583 + H5Dwrite for 1D datasets: 170.2729 + H5Dread for 2D datasets: 19.9304 + H5Dwrite for 2D datasets: 93.9265 + H5Dclose for input datasets: 0.0737 + H5Dclose for output datasets: 0.0516 + ------------------------------------------------------------- + Read metadata from input files: 1.2782 + Create output file + datasets: 25.7466 + Concatenate small datasets: 158.8102 + Write to partition key datasets: 14.0372 + Concatenate large datasets: 114.4520 + Close input files: 0.0124 + Close output files: 0.4799 + End-to-end: 314.8095 +``` + diff --git a/examples/concated.tiff b/examples/concated.tiff new file mode 100644 index 0000000..8391ef3 Binary files /dev/null and b/examples/concated.tiff differ diff --git a/examples/s06_s07.tiff b/examples/s06_s07.tiff new file mode 100644 index 0000000..d1aed53 Binary files /dev/null and b/examples/s06_s07.tiff differ diff --git a/examples/sample_stdout.txt b/examples/sample_stdout.txt index 0e5b884..8a40687 100644 --- a/examples/sample_stdout.txt +++ b/examples/sample_stdout.txt @@ -5,7 +5,7 @@ Output datasets are compressed with level 6 Read metadata from input files takes 0.0096 seconds Create output file + datasets takes 0.0149 seconds Concatenating 1D datasets takes 0.0480 seconds -Writ partition key datasets takes 0.0240 seconds +Write partition key datasets takes 0.0240 seconds Concatenating 2D datasets takes 0.0081 seconds Close input files takes 0.0005 seconds Close output files takes 0.0010 seconds