Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dicom archive #1117

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

maximemulder
Copy link
Contributor

@maximemulder maximemulder commented Apr 25, 2024

Description

This PR is a rewrite (and cleaning/modernization) of dicom-archive/dicomTar.pl, dicom-archive/dicomSummary.pl and dicom-archive/DICOM/DCMSUM.pm in Python, which intends to replace them in the long term. This PR does not replace them for now as we should do more testing and integrate these scripts with the rest of the pipeline before doing so, likely in the next LORIS version imo.

Presentation

This PR adds three Python scripts:

  • dicom_summary.py, which is the port of dicomSummary.pl and generates a textual summary of a DICOM directory (surprise !).
  • dicom_archive.py, which is the port of dicomTar.pl, creates a DICOM directory archive for and (optionally) inserts or updates its information in the LORIS database.

To support these scripts, the python/lib/dicom/ directory was added, which contains various modules used by the aforementioned scripts.

  • The summary_*.py files contain code for the DICOM summary.
  • The dicom_log.py file contains code for the DICOM archiving log.
  • The text_*.py files contains text utilities code used by the two previous points.
  • The dicom_database.py file contains the code used to insert a DICOM directory's information in the database.

Breaking changes

I removed the following options from dicomTar.pl / dicom_archive.py, as they did not seem to be used by any other script:

  • -centerName, which fills the tarchive.neurodbCenterName field in the database, but seems unused in LORIS and LORIS-MRI.
  • -mri_upload_update, which launchs the updateMRI_Upload.pl script after dicomTar.pl and also seems unused (but less confident about that one).
  • I also think splitting -clobber into --db-update (to update instead of inserting in the database) and --owerwrite (to replace files on the disk if they exist) might be a good idea.
  • I also added the --year option, which adds the created DICOM archive in a dedicated year subdirectory (according to the scan date), which is convenient for me at the BIC.

I did not port the tools/addSeriesAndFileRecords.pl script as it seems to not be used in LORIS / LORIS-MRI, and should maybe be deleted in the long term. It is however very easy to re-code with the new architecture (it is currently already a simple script).

PLEASE TELL ME IF ANYONE HAS A PROBLEM WITH THESE CHANGES. Other suggestions are also welcome.

Remaining work

Before merging:

  • Add a markdown documentation for the scripts.
  • Do more testing.
  • More exhaustive error handling.

Longer term work:

  • Integrate the new scripts with the existing pipelines and remove the old ones.
  • Clean up the tarchive table (remove the neurodbCenterName / LastUpdated fields ? make CenterName nullable ?).

@maximemulder maximemulder marked this pull request as draft April 25, 2024 16:15
@maximemulder
Copy link
Contributor Author

Okay, I think the PR is finished, I am happy with the code and tested it on a few DICOMs on my side.

The tests are failing because I am using a few match statements, which were added in Python 3.10. So I'd like to bump the Python minimum supported version to 3.10, if not, I can remove them (but obviously prefer the newer and shinier version).

We should also test it on DICOMs from other projects. Then, when this PR is merged, I can make another PR to adopt it in the different superscripts, although a problem will be that this script requires the Python profile file, while the scripts that call dicomTar.pl use the Perl one (so both will probably need to be passed as arguments).

I will note a few things on the code:

  • I did not use our custom argument parser because there is now argparse in the standard Python library, which handles all the cases we want. We probably want to transition to that long term.
  • I also wrote a few database helpers for my code, but I think in the long term we should transition to a database and query abstraction library, like sqlalchemy which I've heard about.

@maximemulder maximemulder marked this pull request as ready for review May 21, 2024 21:15
@maximemulder
Copy link
Contributor Author

maximemulder commented Jul 4, 2024

Blocked on Python 3.10, as well as #1141 and #1142 to see if this PR should be updated with new libraries.

NOTE for myself: Bring a fix from the new-mri-upload branch before merging (I don't remember which one but it should be easy) and do not initiate the database connection if the profile is not supplied.

Copy link
Collaborator

@cmadjar cmadjar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maximemulder I read through the code and have a few comments/refactoring suggestions. Let me know your thoughts.

Note: I did not test yet, just went through the code.

Comment on lines +21 to +31
def print_error_exit(message: str, code: int):
print(f'ERROR: {message}', file=sys.stderr)
sys.exit(code)


def print_warning(message: str):
print(f'WARNING: {message}', file=sys.stderr)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you check the log.py library for the messaging?



# Modified version of 'lorisgetopt.load_config_file'.
# We use argparse to parse the command line options in this script,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be best to use the lorisgetopt.py library to parse arguments since it checks and has a few other loris specific functionalities. Also, it avoids the duplicated code for load_config_file.py. It also avoids having to add another dependency.

Copy link
Contributor Author

@maximemulder maximemulder Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of lorisgetopt.py design, but you are probably right that it is best to avoid duplication, I'll do it. Also, argparse is part of the Python standard library, so I would not consider it another dependency.

@@ -0,0 +1,231 @@
from datetime import datetime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should go under the database_lib directory. This is where we store the DB interaction files. The files are organized by table names and holds the table specific queries. It would be best to keep that organization.

if db is not None:
print('Checking database presence')

archive = lib.dicom.dicom_database.get_archive_with_study_uid(db, summary.info.study_uid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe rename to db_archive to make is clear that this is the archive found in the DB?

Suggested change
archive = lib.dicom.dicom_database.get_archive_with_study_uid(db, summary.info.study_uid)
db_archive = lib.dicom.dicom_database.get_archive_with_study_uid(db, summary.info.study_uid)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, done !


print('Calculating DICOM tar MD5 sum')

tarball_md5_sum = lib.dicom.text.make_hash(tar_path, True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use the function compute_md5_hash in lib.utilities to compute the hash.

Copy link
Contributor Author

@maximemulder maximemulder Aug 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to do that and encountered two problems:

  • The behaviour of dicomTar.pl is to not only write the hash, but also the filename ("filename hash"). make_hash emulates this behaviour but not compute_md5_hash (although I would not be against writing just the hash).
  • compute_md5_hash also works on non-existing files, while make_hash does not. This makes typing (rightfully) unhappy because I need a str, not a str | None (compute_md5_hash docstring is incorrect with regards to its real return type).

from lib.dicom.text_dict import DictWriter


class Log:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is already a class Log that exists. I would rename this one DicomLog instead to make it clear this is for the DICOM log file.

"""
Order comparison between two nullable integers.
"""
match a, b:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After our LORIS-MRI call, I checked with the install instructions and unfortunately, there might be some instances where earlier versions of python might be installed as the default. Two solutions:

  1. modify the install instructions to make sure that users always use python >3.10
  2. change the code so that it works with earlier versions of python

Copy link
Contributor Author

@maximemulder maximemulder Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with both options but have a preference for updating the docs for Python 3.10+. It seems that 3.8 should (theoretically) become EOL soon, so we might as well do the transition imo.

https://devguide.python.org/versions/

Comment on lines +36 to +45
case _, None:
return -1
case None, _:
return 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand what the _ stands for and why the returns here are -1 or 1

Copy link
Contributor Author

@maximemulder maximemulder Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ is the wildcard pattern, which kind of means "anything works" (so any Some(x) such as Some(0), Some(1)... work here, since None was already matched before).
The returns -1, 1... are because this is a comparison function, a negative number means that a < b, 0 that a == b, and a positive number that a > b. This is required by cmp_to_key in summary_make.py but I'll add some documentation.

Comment on lines 82 to 94
def get_dir_files(prefix: str, path: str) -> list[str]:
"""
Recursively get the files of a directory.
"""
if os.path.isdir(prefix + '/' + path):
files = []
for file in os.listdir(prefix + '/' + path):
files += get_dir_files(prefix, path + '/' + file)

# Flatten the lists of files
return files

return [path]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function could be moved to lib.utilities and could be used by other processes since it is not specific to DICOMs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done !


with open(tar_path, 'rb') as tar:
with gzip.open(zip_path, 'wb') as zip:
shutil.copyfileobj(tar, zip)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could copy_file from lib.utilities be used here? Maybe not since you are using copyfileobj. Just wondering

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, the code here gzips the copied file, which the copyfile does not do.

@maximemulder
Copy link
Contributor Author

Thanks for the review @cmadjar ! I wrote some comments on a few points. Points with no comments mean that I agree or will look into it.

@cmadjar
Copy link
Collaborator

cmadjar commented Aug 1, 2024

@maximemulder I think if you rebase this one, it will pass the flake8 now that your PR with the python install update has been merged. Thanks!

Also, regarding the -mri_upload_update, this option is used by projects that do not use the imaging uploader to upload their scan and run directly dicomTar.pl. I think we should still support that functionality as some projects might still be using the old way of importing DICOMs. Otherwise, the script to create the MINC or BIDS dataset would break because there is no entry in the mri_upload table. Actually, before you do this change, could you bring it up during the next imaging call? It should be next week on August 8th. Thank you :)

@maximemulder
Copy link
Contributor Author

Ok I replied to all your (very good !) comments @cmadjar. The only ones I have intentionally left out are these ones :

  1. Move SQL queries to database_lib (waiting on a decision on SQLAlchemy at the next imaging meeting).
  2. Add the --db-upload parameter (done for a while on my SQLAlchemy branch)
  3. Use log.py (full implementation depends on --db-upload)

If you like the rest of the PR, I can add these very quickly after the next imaging meeting.

@laemtl
Copy link
Contributor

laemtl commented Sep 13, 2024

Waiting for the first test implementation to polish it.

@laemtl laemtl added A-CI Area: CI. Issues and pull requests related to CI, including automated tests and static checks Needs Work Before Merging and removed A-CI Area: CI. Issues and pull requests related to CI, including automated tests and static checks labels Sep 13, 2024
@maximemulder maximemulder marked this pull request as draft September 13, 2024 14:39
@maximemulder maximemulder force-pushed the rewrite-dicom-archive branch 10 times, most recently from 2e7b3c4 to 3503263 Compare September 24, 2024 18:43
@maximemulder maximemulder added the S-Large Size: Large. Pull requests that are large or complex and require a thorough review label Oct 4, 2024
@maximemulder maximemulder added the A-DICOM Area: DICOM. Issues and pull requests related to DICOMs and the DICOM archive ingestion label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-DICOM Area: DICOM. Issues and pull requests related to DICOMs and the DICOM archive ingestion Needs Work Before Merging S-Large Size: Large. Pull requests that are large or complex and require a thorough review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants