Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend whitening tests #3531

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

JoeZiminski
Copy link
Collaborator

@JoeZiminski JoeZiminski commented Nov 12, 2024

Further to #3505, this PR extends the whitening tests to check the values are correct for both int16 and float32, as well as checking most of the parameters. Also, some functions are factored out in whiten.py to help with testing.

For the most part the approach is to generate some test data with known covariance, then compute whitening and check that a) the covariance matrix is recovered correctly b) the whitened data is indeed white.

Some points for discussion @yger would be great to get your thoughts:

  1. apply_mean=False by default, I don't think (?) there is any reason not to remove the mean prior to computing the covariance, which will reduce bias. Indeed the whitened data is not really white when apply_mean=False but is when apply_mean=True. Should this be default True?

  2. For regularizing, regularize_kwargs["assume_centered"] = True is always set in the code, though the default apply_mean=False as above. I added an error in this case, to force the user to switch to apply_mean=True. Also, in case the user explicitly sets regularize_kwargs["assume_centered"] = False an error is raised, as this will be overwritten to True. However, maybe a better approach is to set regularize_kwargs["assume_centered"] directly from apply_mean?

  3. At the moment sklearn_covariance provides some features that are represented as classes, and others as functions. At the moment the regularize kwargs can only accept the class ones and will crash for function ones. I just added a note to the docstring, but wanted to raise this in case it is important to support both.

Question on CustomRecording

@alejoe91 @samuelgarcia I wanted to make a quick recording object and set some custom data on it for testing purposes. The result is the small CustomRecording (top of test_whiten,py) which basically exposes a set_data function, and get_traces() just returns the data that was set directly without processing. Is this okay for you both? If so I will factor this out into a separate PR.

@@ -223,27 +198,88 @@ def compute_whitening_matrix(
# type and we estimate a more reasonable eps in the case
# where the data is on a scale less than 1.
if eps is None:
eps = 1e-8
if data.dtype.kind == "f":
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the random_chunk -> data is always float, I removed the int16 option, let me know if this makes sense.

Out of interest, what is the purpose of this scaling? will it have much effect if it is capped in the maximum direction by 1e-16?

@alejoe91
Copy link
Member

@JoeZiminski for the CustomRecording, wouldn't a NumpyRecording work as well? What would be the added value of the CustomRecording?

@JoeZiminski
Copy link
Collaborator Author

Thanks @alejoe91, there is no added value! I just didn't know NumpyRecording existed 😄 thanks I'll use that

@h-mayorquin
Copy link
Collaborator

The NoiseGeneratorRecording already has the option of passing the covariance matrix:

class NoiseGeneratorRecording(BaseRecording):
"""
A lazy recording that generates white noise samples if and only if `get_traces` is called.
This done by tiling small noise chunk.
2 strategies to be reproducible across different start/end frame calls:
* "tile_pregenerated": pregenerate a small noise block and tile it depending the start_frame/end_frame
* "on_the_fly": generate on the fly small noise chunk and tile then. seed depend also on the noise block.
Parameters
----------
num_channels : int
The number of channels.
sampling_frequency : float
The sampling frequency of the recorder.
durations : list[float]
The durations of each segment in seconds. Note that the length of this list is the number of segments.
noise_levels : float | np.array, default: 1.0
Std of the white noise (if an array, defined by per channels)
cov_matrix : np.array | None, default: None
The covariance matrix of the noise
dtype : np.dtype | str | None, default: "float32"
The dtype of the recording. Note that only np.float32 and np.float64 are supported.
seed : int | None, default: None
The seed for np.random.default_rng.
strategy : "tile_pregenerated" | "on_the_fly", default: "tile_pregenerated"
The strategy of generating noise chunk:
* "tile_pregenerated": pregenerate a noise chunk of noise_block_size sample and repeat it
very fast and cusume only one noise block.
* "on_the_fly": generate on the fly a new noise block by combining seed + noise block index
no memory preallocation but a bit more computaion (random)
noise_block_size : int, default: 30000
Size in sample of noise block.
Notes

Would not that eliminate some of your scaffolding? Is there something that you are missing there?

Also, maybe the following would be useful:
https://docs.pytest.org/en/stable/reference/reference.html#pytest-importorskip

@JoeZiminski
Copy link
Collaborator Author

JoeZiminski commented Nov 13, 2024

Thanks @h-mayorquin! I did look at that but I think the one thing I need which is not there is the option to cast the data to int16, and to pass some custom means. The latter would be easy to extend, the former also I think actually, but would it be useful? If so I could extend the NoiseRecording to take int16 as a dtype and cast the MVN data using this strategy? (unless there is a better one).

Awesome thanks for that importdata skip will check it out! 🤤

@alejoe91 alejoe91 added the preprocessing Related to preprocessing module label Nov 13, 2024
@alejoe91
Copy link
Member

Thanks @h-mayorquin! I did look at that but I think the one thing I need which is not there is the option to cast the data to int16, and to pass some custom means. The latter would be easy to extend, the former also I think actually, but would it be useful? If so I could extend the NoiseRecording to take int16 as a dtype and cast the MVN data using this strategy? (unless there is a better one).

Awesome thanks for that importdata skip will check it out! 🤤

I think that since you need to craft specific sets of traces, you'd be better off using the NumpyRecording directly :)

@h-mayorquin
Copy link
Collaborator

Yeah, I think @alejoe91 suggestion is the simpler one. I would vote for that unless you need considerable memory heavy traces as it would simplify your scaffolding and would not require any other changes.

For the NoiseRecording You could `astype' preprocessing if you need the type of that data (the problem is that we use white noise under the hood and that' can be generated on the fly, I can give you more details on slack if you are curios). Extend NoiseRecording to accept a mean seems like a good thing though, maybe something to do but is outside of the scope of this PR.

@JoeZiminski
Copy link
Collaborator Author

Great thanks both! I went for NumpyRecording in the end as suggested, I think this is ready for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preprocessing Related to preprocessing module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants