Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JP-3456 - Replace use of scipy.signal.medfilt in outlier detection #8033

Merged
merged 5 commits into from
Nov 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,15 @@ general

- Add lack of python 3.12 support to project metadata [#8042]

- Increase asdf maximum version to 4 [#8018]
- Increase asdf maximum version to 4. [#8018]

- Remove upper version limit for scipy. [#8033]

outlier_detection
-----------------

- Remove use of ``scipy.signal.medfilt`` which is undefined for ``nan``
inputs. [#8033]

imprint
-------
Expand Down
19 changes: 15 additions & 4 deletions jwst/outlier_detection/outlier_detection_ifu.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,28 @@
"""Class definition for performing outlier detection on IFU data."""

import logging

import numpy as np
from skimage.util import view_as_windows

from stdatamodels.jwst import datamodels
from scipy.signal import medfilt
from .outlier_detection import OutlierDetection
from stdatamodels.jwst.datamodels import dqflags
import logging
from .outlier_detection import OutlierDetection

log = logging.getLogger(__name__)
log.setLevel(logging.DEBUG)

__all__ = ["OutlierDetectionIFU"]


def medfilt(arr, kern_size):
# scipy.signal.medfilt (and many other median filters) have undefined behavior
# for nan inputs. See: https://github.com/scipy/scipy/issues/4800
padded = np.pad(arr, [[k // 2] for k in kern_size])
windows = view_as_windows(padded, kern_size, np.ones(len(kern_size), dtype='int'))
return np.nanmedian(windows, axis=np.arange(-len(kern_size), 0))


class OutlierDetectionIFU(OutlierDetection):
"""Sub-class defined for performing outlier detection on IFU data.

Expand Down Expand Up @@ -220,7 +231,7 @@ def flag_outliers(self, idet, uq_det, ndet_files,
minarr = np.nanmin(diffarr, axis=0)

# Normalise the differences to a local median image to deal with ultra-bright sources
normarr = medfilt(minarr, kernel_size=kern_size)
normarr = medfilt(minarr, kern_size)
nfloor = np.nanmedian(minarr)/3
normarr[normarr < nfloor] = nfloor # Ensure we never divide by a tiny number
minarr_norm = minarr / normarr
Expand Down
42 changes: 42 additions & 0 deletions jwst/outlier_detection/tests/test_medfilt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
import warnings

import pytest
import numpy as np
import scipy.signal

from jwst.outlier_detection.outlier_detection_ifu import medfilt



@pytest.mark.parametrize("shape,kern_size", [
([7, 7], [3, 3]),
([7, 7], [3, 1]),
([7, 7], [1, 3]),
([7, 5], [3, 3]),
([5, 7], [3, 3]),
([42, 42], [7, 7]),
([42, 42], [7, 5]),
([42, 42], [5, 7]),
([42, 7, 5], [3, 3, 3]),
([5, 7, 42], [5, 5, 5]),
])
def test_medfilt_against_scipy(shape, kern_size):
arr = np.arange(np.prod(shape), dtype='uint32').reshape(shape)
result = medfilt(arr, kern_size)
expected = scipy.signal.medfilt(arr, kern_size)
np.testing.assert_allclose(result, expected)


@pytest.mark.parametrize("arr,kern_size,expected", [
([2, np.nan, 0], [3], [1, 1, 0]),
([np.nan, np.nan, np.nan], [3], [0, np.nan, 0]),
])
def test_medfilt_nan(arr, kern_size, expected):
with warnings.catch_warnings():
warnings.filterwarnings(
"ignore",
message="All-NaN slice",
category=RuntimeWarning
)
result = medfilt(arr, kern_size)
np.testing.assert_allclose(result, expected)
5 changes: 3 additions & 2 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,8 @@ install_requires =
poppy>=1.0.2
pyparsing>=2.2.1
requests>=2.22
scipy>=1.6.0,<1.10.0
scikit-image>=0.19
scipy>=1.6.0
spherical-geometry>=1.2.22
stcal>=1.4.4,<1.5.0
stdatamodels>=1.8.3,<1.9.0
Expand All @@ -53,7 +54,7 @@ install_requires =
tweakwcs>=0.8.3
asdf-astropy>=0.3.0
wiimatch>=0.2.1
packaging>19.0
packaging>20.0
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was required for the minimum version (0.19) of scikit-image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks ok to me. From our slack messages I assume you have run the outlier detection regression tests and they now show you no difference

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look.

I should clarify, this PR does cause 12 regtest failures. The failures are from 2 parameterized tests:

These are the same tests that fail when scipy is updated, or when run on a mac and can be explained by the undefined (and unreliable) output of scipy.signal.medfilt when provided a np.nan input (as occurs in the failing tests). To use the examples in the unit test added with this PR.

arr = [2, np.nan, 0]
ksize = [3]

The expected output using a median filter (and filling edges with 0s as is done by default for scipy.signal.medfilt) should be [1, 1, 0].

If I run the following on my mac, with scipy 1.9.3 I get a different and incorrect output:

>>> scipy.signal.medfilt([2, np.nan, 0], [3])
array([nan, nan,  0.])

with scipy 1.11.3 I get a different (and also incorrect):

>>> scipy.signal.medfilt([2, np.nan, 0], [3])
array([0., 0., 0.])

Other examples show differences for linux vs mac.

As the truth files were generated using scipy.signal.medfilt and the input contains nan the truth file contents depend on the undefined behavior of the version of scipy used to generate the file. The truth files for these tests will need to be updated if the output is acceptable (and this PR is merged).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To further clarify, with this PR the new output (which differs from the truth files) is identical on mac and linux for the _crf and _x1d files. The _s3d files show differences (< 1e-5) on mac and linux (independent of this PR other tests that generate _s3d files show differences between mac and linux on the main branch as seen in the jenkins devdeps tests so this looks to be a separate issue, one not addressed in this PR).

For the _crf files, the differences are only for pixels with a nan within the 7x7 window used for the median filter. These are largely found on the edges of usable pixels. Taking the first example in the diff for file jw01249005001_03101_00001_nrs1_o005_crf.fits. The fitsdiff output reports:

Extension HDU 1 (SCI, 1):
   Data contains differences:
     Data differs at [1716, 94]:
        a> nan
        b> -103.99343
...
   Data contains differences:
     Data differs at [1716, 94]:
        a> 33554449
         ?       ^^
        b> 33554432

indicating that the pixel at (using numpy indices) [93, 1715] is being flagged as an outlier in the result file but not in the truth file. Inspecting the minarr calculated during the outlier detection step:

minarr = np.nanmin(diffarr, axis=0)
# Normalise the differences to a local median image to deal with ultra-bright sources
normarr = medfilt(minarr, kernel_size=kern_size)
nfloor = np.nanmedian(minarr)/3
normarr[normarr < nfloor] = nfloor # Ensure we never divide by a tiny number
minarr_norm = minarr / normarr

>> minarr[93, 1715]
66.545156955719
>> minarr[90:97, 1712:1719]
array([[        nan,         nan,         nan,         nan,         nan,
                nan,         nan],
       [        nan,         nan,         nan,         nan,         nan,
                nan,         nan],
       [        nan,         nan,         nan,         nan,         nan,
                nan,         nan],
       [29.89870262,  4.97431731, 19.80752754, 66.54515696, 27.27770805,
        17.14254284,  2.87409228],
       [ 0.38395762,  2.3855598 ,  4.7994982 ,  8.4161818 ,  1.58876133,
         3.21722126,  1.05671763],
       [ 0.14560556,  1.19486213,  0.89728248,  0.40425563,  1.58876133,
         1.25159645,  0.71704447],
       [ 0.14560556,  1.19486213,  0.89728248,  0.40425563,  2.14912176,
         0.49365234,  0.51363105]])

This pixel is on the edge of a group of usable pixels, inspecting the computed normarr (the output of medfilt) for this PR we see:

>>  normarr[90:97, 1712:1719]
array([[19.80752754, 27.27770805, 27.27770805, 19.80752754, 17.14254284,
        19.80752754, 17.14254284],
       [ 6.30404165,  7.02648562,  6.69524956,  4.88690776,  4.00835973,
         4.62222749,  4.19920981],
       [ 3.89869487,  3.89869487,  2.3855598 ,  2.3855598 ,  2.3855598 ,
         2.78641891,  2.78641891],
       [ 1.19486213,  1.58876133,  1.42017889,  1.42017889,  1.42017889,
         1.58876133,  1.58876133],
       [ 1.0041163 ,  1.58876133,  1.25159645,  1.25159645,  1.25159645,
         1.58876133,  1.51533907],
       [ 0.9249387 ,  1.30883139,  1.22322929,  1.22322929,  1.19486213,
         1.33719856,  1.46906986],
       [ 0.91056663,  1.42280066,  1.19486213,  1.25159645,  1.19486213,
         1.25159645,  1.42280066]])

Which differs substantially from the normarr computed without this PR (using scipy.signal.medfilt):

>> scipy.signal.medfilt(minarr)[90:97, 1712:1719]
array([[        nan,         nan,         nan,         nan,         nan,
                nan,         nan],
       [        nan,         nan,         nan,         nan,         nan,
                nan,         nan],
       [19.80752754, 27.27770805, 27.27770805, 19.80752754,         nan,
        19.80752754, 17.14254284],
       [19.80752754,  8.83398247, 27.27770805, 19.80752754, 17.14254284,
        27.01134586,         nan],
       [ 3.89869487,  2.41227081,  2.41227081,  2.41227081,  2.2597177 ,
         2.78641891,  2.2597177 ],
       [ 1.19486213,  1.62179971,  1.58876133,  1.58876133,  1.58876133,
         1.62179971,  1.62179971],
       [ 0.91056663,  1.42280066,  1.19486213,  1.25159645,  1.19486213,
         1.25159645,  1.42280066]])

Inspecting the value of normarr used in the further outlier detection for the pixel in question gives 2 very different results for this PR vs main:

>> scipy.signal.medfilt(minarr)[93, 1715]
19.807527542114258
>> normarr[93, 1715]
1.4201788902282715

As the normarr values are later used to compute the ratio of the minarr value relative to the local median this pixel with this PR is ~66 / 1.4 and flagged as an outlier vs ~66 / 19.8 and not flagged as an outlier. As this 'flag' is used to mark bad pixels in all 8 input files this explains the 8 _crf test failures.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jemorrison do these changes look reasonable to you?

importlib-metadata>=4.11.4
jsonschema>=4.8

Expand Down