Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for a new interface. #334

Merged
merged 29 commits into from
Jan 29, 2024
Merged

Proposal for a new interface. #334

merged 29 commits into from
Jan 29, 2024

Conversation

DinoBektesevic
Copy link
Member

See example usage in this gist.

Note that the functionality within the classes themselves is not necessarily correct. It was a lot of work so some things have cut corners.

  • Some classes are maybe not named the best, see: DECamCPFits for example. Rename
  • DECamCPFits also uses the astro_metadata_translator package but only works for DECam - no bueno, either make the header translation yourself or make that class general enough to work for all instruments supported by metadata translator.
  • Everything about how DECamCPFits works is completely made up and does not represent the data correctly. Fix how variance is calculated, fix how masks are created. Find datasets to test on because that's the current difficulty.
  • Nearly all Standardizers except the ButlerStandardizer have made up PSFs - figure out what to do with them
  • Standardizers don't take in a config that sets how masks or PSFs are created
  • Write tests, unittests, CI tests etc.
  • Try to look into moving the C++ code onto ndarray or something, the cost of copying the Python arrays is super large and if we could just pass a pointer via numpy arrays API it would basically be a no-op. We could internalize the GPU representation in C++ code if we really need to flatten them on transfer to GPU via the strides or something.
  • I'm sure there's more, I left comments strewn throughout the code where I got stuck on something.

Use it, I'm sure there'll be more ideas on what we can fix or add or change. I'm still trying to work out how best to integrate in the repo.

@DinoBektesevic
Copy link
Member Author

I came up with a couple of more want-to's and good to-do's:

  • Serializing ImageStack
  • Serializing LayeredImage
  • Serializing RawImage
  • Potentially round-tripping ImageCollection to ImageStack and back
  • Look into adopting lsst.resource as our internal handler of paths, URLs and URIs
  • Delete the current C++ load_file stuff
  • Delete the run_search class
  • Delete the Interface class
  • Can we come up with a more Pythonic convention of naming and binding the C++ classes
  • Clean up some of the method names and attributes in C++ (PPI is n_pixels in image, not pixels per inch for example)

These would all be good ways to create an export functionality to ImageStack in a way that would alleviate the need to constantly create and re-create the ImageStack required for processing with KBMOD. It would be a valid solution to the issue of the creation of the ImageStack from various data sources being slow but also it would make the staging of the data to process much easier in batch-like processing environments. If we could make it roundtrip too then we could carry all the metadata along and make ImageCollection and ImageStack more equivalent, in content even when not in functionality. These files might be large, but they would still be smaller than transferring 100 DECam Community Pipelines processed FITS files with all their CCDs in them just to be able to create the ImageStack on the processing resource for 1 targeted CCD.

The exported FITS file could have a primary header, or a BinTable HDU, containing the internal metadata and indices that could carry the metadata table with it. The challenge would be associating masks, variance and science exposure extensions with each other - which would need a mechanism (a lookup table of names or extension indices) in it. This could be based on the existing indices in the table-row-to-serializer-extension-index but they could also be unrelated to that.

@DinoBektesevic
Copy link
Member Author

  • Time the masking code and make sure we haven't lost too much by moving that out of C++. Try to optimize, or just move to C++ if it's that bad.

src/kbmod/image_collection.py Outdated Show resolved Hide resolved
src/kbmod/image_collection.py Outdated Show resolved Hide resolved
src/kbmod/image_collection.py Show resolved Hide resolved
src/kbmod/image_collection.py Outdated Show resolved Hide resolved
src/kbmod/image_collection.py Outdated Show resolved Hide resolved
src/kbmod/standardizers/butler_standardizer.py Outdated Show resolved Hide resolved
src/kbmod/standardizers/butler_standardizer.py Outdated Show resolved Hide resolved
src/kbmod/image_collection.py Show resolved Hide resolved
# solution and a flat lookup table.
# If they weren't added when standardizers were unravelled it's
# basically not possible to reconstruct them. Guess attempt is no good?
no_std_map = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What survey's have this type of nested data? I remember discussing this in terms of the type of input file, but can't remember where we'd expect to see those input files.

Edit: I see you mention deccam below. Were the previous files KBMOD used flattened as part of preprocessing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost all of them are like this. DECam processed via the Science Pipelines will have 1 FITS file per CCD. DECam processed via Community pipelines will have 1 FITS file per 62-72 CCDs depending on the data product in question. Raws will have 72 (focus and tracking CCDs) HDUs, calibrated exposures will have 62 HDUs (just the science CCDs).

This doesn't mean that the Rubin Sci. Pipes. will have only 1 HDU. They have ~16HDUs per FITS file processed to a calexp and ~30 for coadds. It's just that the other HDUs hold data like PSF, mask, variance etc. whereas for CommunityPipeline products these are separate data products (different FITS files for example).

More details are given in the following comment and the design document as well.

src/kbmod/image_collection.py Show resolved Hide resolved
src/kbmod/standardizers/standardizer.py Outdated Show resolved Hide resolved
src/kbmod/standardizers/standardizer.py Show resolved Hide resolved
raise NotImplementedError()

# no idea really what to do bout this one? AFAIK almost no images come
# with PSFs in them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Rubin's plan for providing PSFs? Is is a separate query?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's in the FITS file (that is, it's part of the object you can retrieve via the Butler), but it's their own internal representation of a PSF - so this isn't something we can just rely on as a generic thing. We can make the ButlerStandardizer use it to load up the object in Python - but then we need to do something with that object, like evaluate a realization of a PSF at some wavelength and position on the CCD as a numpy array for example. But even then we need to line up that realization with pixel size basically, so that the realization is basically over the same pixels as our images, i.e. so their physical sizes are comparable, because that's how the current PSF class works - the kernel dimension is directly related to the pixel size. For other instruments I've no idea what to do tbh.

A part of the problem is that PSF class we have is not a good representation of a PSF. Not because Gaussians are bad functional representation of a PSFm but because it only stores a representation of a PSF as an array and not some functional expression.
But even then, as far as I have seen there are no ways the Rubin PSF class can be "exported" as any of the to extract the functional PSF forms from AstroPy (f.e. like moffat) or photutils - which again wouldn't even help us that much because we still need an C++ representation of the same again if we want to evaluate it at points (in a way that is generic to where we can do things like PSFs that vary across the CCD) and so on. I guess we can evaluate it and then fit it - but the costs involved here are kinda crazy.

src/kbmod/image_collection.py Outdated Show resolved Hide resolved
src/kbmod/standardizers/standardizer.py Show resolved Hide resolved
src/kbmod/standardizers/standardizer.py Outdated Show resolved Hide resolved
src/kbmod/standardizers/standardizer.py Outdated Show resolved Hide resolved
src/kbmod/standardizers/standardizer.py Outdated Show resolved Hide resolved
tests/test_standardizer.py Outdated Show resolved Hide resolved
tests/test_standardizer.py Show resolved Hide resolved
src/kbmod/image_collection.py Show resolved Hide resolved
src/kbmod/image_collection.py Show resolved Hide resolved
# Rubin Sci. Pipes. return their own internal SkyWcs object. We mock a
# Header that'll work with ButlerStd instead. It works because in the
# STD we cast SkyWcs to dict-like thing, from which we make a WCS. What
# happens if SkyWcs changes though?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a test that creates a SkyWcs and makes sure it can convert to a dict-like thing? Or, more realistically we can do a fix if it changes later.

Copy link
Member Author

@DinoBektesevic DinoBektesevic Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SkyWcs comes from the lsst.afw, if I mock it I can mock it in a way that it dictifies, but that still doesn't tell me anything about the real SkyWcs from AFW C++ code used by Rubin. Thankfully this is not likely to change, but generally doing these tests is a pain. We need a canary CI that will build from Rubin weeklies, but the setup is hard so it's a big job to do and the CI run itself will probably go on for a few hours as it would need to build the weekly, run some data through it and then run the KBMOD tests.
This has been my long-time goal for a while though.

I'm not sure how to create a SkyWcs here, could you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure how to create a SkyWcs either. I was just suggesting that, if we could create one, that we could use that to watch for breakages.

@DinoBektesevic
Copy link
Member Author

Blocked by #440
Do not merge without rebasing after #440 merges.

performance penalty.

_isImageLike loads every image from the disk, and then promptly
forgets them. This is super costly.
Rejigger the whole canStandardize pipeline to get that to go faster.

Cleanup some docs, remove _isMultiExt method - needless overhead.
Required changing how Standardizer.get works, which also fixed how
`forceStandardizer` works and how "unravelling" of metadata from
Standardizers work.

It provides even more impetus to record some metadata into the
.meta attribute; so that the full butler could be reconstructed
from it. This requires some thinking - but it could be handled
by silently ignoring **kwargs from meta that are not explicitly
defined in a particular Standardizer's __init__.

This was significantly harder than anticipated and very
unintuitive.

Reworked the way bitmasks work - now via Astropy.bitmask module.
Implement the methods required to run KBMOD on ImageCollection.
Cleanup ImageCollection behaviour:
*) return image collection when indexed by lists, arrays and slices
*) return Row when indexed by integer
*) return Table when sliced by columns
*) Rename the exts to processable. Alias it to a property so that
   each Standardizer can implement its own internal structure the
   way it wants (but also because I was too lazy to rename everything)
*) Fix documentation
*) Move WCS and BBOX as properties to a Standardizer - if that's
   where we need to explain why they are special that's where they
   need to live. Make them an abstractproperty and demand that the
   Standardizers return a list of None's if need be.
*) Fix forceStandardizer keyword (again).
*) Add toLayeredImage as an abstract method to the Standardizers
   Implement them for the three example Standardizers we have.
*) Add toImageStack as an abstract method to the standardizers.
   Implment them in ImageCollection
*) Add run method prototype to ImageCollection to showcase how
   we can neatly integrate with the ImageCollection to execute
   KBMOD runs.

Write an example python script showcasing most of this functionality.

TODO: tests, unittests, integrationtests all the tests.
This is perhaps not the best solution, but the old solution
lead to a proliferation of factory functions that basically
interpreted the different input data type and then constructed
standardizers from that. This floated up to ImgCollection which
now had about a million fromHDUL fromHDULs fromPath fromPaths etc.
Refactor the mock_fits util a bit.
Cleanup affected code.
Fix the wrong resolveTarget for ButlerStandardizer.
Update the init methods for fits files to specialize for hdulist
    and location instead of a target. This is required to make roundtripping
    of ImageCollection work, but should be updated in the future.
@DinoBektesevic DinoBektesevic merged commit c4dfc13 into main Jan 29, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants