-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FDataIrregular
personal proposed reviews
#593
FDataIrregular
personal proposed reviews
#593
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## feature/irregular_operations #593 +/- ##
================================================================
- Coverage 86.68% 86.65% -0.03%
================================================================
Files 156 156
Lines 13326 13314 -12
================================================================
- Hits 11551 11537 -14
- Misses 1775 1777 +2 ☔ View full report in Codecov by Sentry. |
I agree, this is better.
I do not think this is necessary. There are three places in which split takes place:
Thank you! I also did not notice that functions without values are not kept (also in the original). I think this would be surprising behavior, as the n-th function after
I am somehow surprised that appending to a list and compute |
Thanks for you're remarks. I was not familier with
I don't have a super strong opinion about this, but I would still vote for these properties. Regarding restrict, you're totally right, I will rewrite it, especially if we agree on your following point.
Hmm... that's a good point. I didn't even question the behaviour since I only meant to clean, but it might indeed be more sensible to keep empty functions. If there is a consensus, I'd gladly incorporate this behaviour.
Pure speculations:
|
I made another comment on purpose to avoid confusion. I would like to know what the position on Personally, I think that:
|
Sorry the comments are starting to pile up, but I discovered this edge case of For example, as you suggested for the review of >>> np.add.reduceat(
... [True, False, True, True, False, True],
... [0, 2, 2, 3]
... )
array([1, 1, 1, 2]) # Instead of desired array([1, 0, 1, 2) I don't know if there's any easy fix around this. But since I agree that empty samples should be allowed, I pin this here for when I have more time to read it, I think it addresses this issue. |
Right now, with just one use, I would argue that it is not easier.
Then we can reevaluate when we have more use cases for it. It is always easier to add features than to remove them.
But they can simply index the
I would say that caching something in programming is very dangerous (because cached things can easily become desynchronized), and should be used sparingly, only when the alternatives are much worse.
I do not discard the possibility right now, but I prefer to wait until the things are more clear.
I think we should keep them, even if empty.
Well, until we do not have benchmark tests (#585), I think it is very difficult to justify doing micro-optimizations blindly, so I would keep your code for now. |
I agree with both of your suggestions. It was the way I was thinking about |
I get this point, it's a good one.
Clearly you're right 😬
I thought so but wasn't entirely sure 👍 Overall I agree, I will remove these properties. If they ever become a bit more useful, we could still make them internal properties |
It is indeed a bit tragic since, as you can see below, Test ran
from math import ceil, floor, log10
from tabulate import tabulate
from time import perf_counter
import numpy as np
def method_reduceat(mask, idxs):
""" reduceat """
return np.add.reduceat(mask, idxs)
def method_split_iterate(mask, idxs):
""" split then iterate """
split = np.split(mask, idxs)[1:]
return np.array([s.sum() for s in split])
def method_only_iterate(split):
""" iterate over precomputed split """
return np.array([s.sum() for s in split])
def time_one(n_points, n_samples):
""" Times on execution of each in a random order. """
rand_mask = np.random.random(n_points) > .5
rand_idxs = np.r_[ # Starts with 0 and strictly increasing
[0],
np.random.choice(np.arange(1, n_points), n_samples - 1, replace=False)]
rand_idxs.sort()
split = np.split(rand_mask, rand_idxs)[1:]
fs_args = [
(method_reduceat, (rand_mask, rand_idxs)),
(method_split_iterate, (rand_mask, rand_idxs)),
(method_only_iterate, (split,)),
]
n = len(fs_args)
perm = np.random.permutation(n)
res = [None] * n
ts = np.ndarray(3)
for i in perm:
f, args = fs_args[i]
t = -perf_counter()
r = f(*args)
t += perf_counter()
res[i] = r
ts[i] = t
np.testing.assert_array_equal(r, res[perm[0]])
return ts
# Util from https://github.com/eliegoudout/lasvegas/blob/dev/perf/__init__.py
def format_time(duration: float, num_digits: int = 4) -> str:
""" Formats a `float` duration in seconds to a human readable `str`.
A few examples with `num_digits = 4` are given below, showcasing
some special cases.
```
╭───────────────┬────────────────┬───────────────────────────────────────╮
│ Duration │ Result │ Comment │
├───────────────┼────────────────┼───────────────────────────────────────┤
│ 1.5 │ 1.500 ss │ Significant 0's added │
│ 0.56789 │ 567.9 ms │ Last digit is rounded... │
│ 0.99995 │ 1.000 ss │ ...which can lead to precision loss │
│ 0.12345 │ 123.4 ms │ Rounds half to even (python built-in) │
│ 1234 │ 1234. ss │ Point is added for constant witdh │
│ 12345 │ 12345 ss │ One more digit for longer durations │
│ 123456 │ AssertionError │ Exceeded max duration │
│ -1 │ AssertionError │ Negative duration │
│ 0 │ 0.000 as │ Smallest unit for shorter durations │
│ 5.67e-20 │ 0.057 as │ Precision is worse near 0. │
╰───────────────┴────────────────┴───────────────────────────────────────╯
```
Implementation heavily relies on following facts:
- Consecutive units have constant ratio of `10 ** 3`,
- Highest unit is the unit of `duration`'s encoding.
Arguments:
----------
duration (float): Expressed in seconds, duration to format. Must
satisfy `0 <= duration < 10 ** (num_digits + 1) - .5`.
num_digits (int): Number of significant digits to display.
Larger durations can have one more and shorter durations
less -- see examples above.
Returns:
--------
(str): Formated duration -- _e.g._ `'567.9 ms'`.
Raises:
-------
`AssertionError` if either `num_digits < 3` or
`not 0 <= duration < 10 ** (num_digits + 1) - .5`
"""
units = ['ss', 'ms', 'us', 'ns', 'ps', 'fs', 'as']
max_pow = 3 * (len(units) - 1)
n = num_digits
assert n >= 3
assert 0 <= duration < 10 ** (n+1) - .5, "Duration out of bounds."
# Special case 0
if duration == 0:
return f"{0:.{n-1}f} " + units[-1]
# Retrieve left shift for significant part
left_shift = ceil(- log10(duration)) + n - 1
significant = round(duration * 10 ** left_shift)
# Special case `0.0099996` -> `'10.00ms'`
if significant == 10 ** n:
significant //= 10
left_shift -= 1
# If `duration` is barely too big: remove floating point
if left_shift == -1:
return f"{round(duration)} " + units[0]
# Nominal case
elif left_shift < max_pow + n:
unit_index = max(0, 1 + (left_shift - n) // 3)
y = significant * 10 ** (3 * unit_index - left_shift)
n_left = int(log10(y) + 1)
unit = units[unit_index]
return f"{y:.{max(0, n-n_left)}f}{'.' if n == n_left else ''} " + unit
# If so small that smallest unit loses precision
else:
return f"{duration * 10 ** max_pow:.{n-1}f} " + units[-1]
def expe(n_points, n_samples, n_runs):
""" Runs multiple times and shows results. """
T = np.r_[[time_one(n_points, n_samples) for _ in range(n_runs)]]
T.sort(0)
min_ = T.min(0)
max_ = T.max(0)
mean = T.mean(0)
p = [5, 25, 50, 75, 95]
std_p = T[ceil(len(T)*p[1]/100):floor(len(T)*p[-1]/100)+1].std(0)
percentiles = np.percentile(T, p, 0)
res = np.c_[min_, max_, mean, std_p, percentiles.T]
names = ["reduceat", "split + iterate", "iterate only (pre-split)"]
headers = ['name', 'Min', 'Max', 'Mean', f'Std {p[0]}-{p[-1]}', *map(lambda pp: f"{pp}%", p)]
table = [[name] + list(map(format_time, line))
for name, line in zip(names, res)]
colalign = ('left',) + ('center',) * (res.shape[1])
print(tabulate(table, headers=headers, tablefmt="rounded_outline", colalign=colalign))
n_points = 10000
n_samples = 1000
n_runs = 1000
expe(n_points, n_samples, n_runs) Results: >>> # Compute new `start_indices` given mask
>>> n_points = 10000
>>> n_samples = 1000
>>> n_runs = 1000
>>>
>>> expe(n_points, n_samples, n_runs)
╭──────────────────────────┬──────────┬──────────┬──────────┬────────────┬──────────┬──────────┬──────────┬──────────┬──────────╮
│ name │ Min │ Max │ Mean │ Std 5-95 │ 5% │ 25% │ 50% │ 75% │ 95% │
├──────────────────────────┼──────────┼──────────┼──────────┼────────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ reduceat │ 34.72 us │ 3.478 ms │ 46.86 us │ 6.896 us │ 35.19 us │ 36.67 us │ 37.98 us │ 44.99 us │ 69.08 us │
│ split + iterate │ 3.950 ms │ 15.35 ms │ 4.543 ms │ 425.9 us │ 4.028 ms │ 4.113 ms │ 4.173 ms │ 4.620 ms │ 6.090 ms │
│ iterate only (pre-split) │ 2.155 ms │ 10.45 ms │ 2.537 ms │ 255.1 us │ 2.203 ms │ 2.276 ms │ 2.317 ms │ 2.507 ms │ 3.335 ms │
╰──────────────────────────┴──────────┴──────────┴──────────┴────────────┴──────────┴──────────┴──────────┴──────────┴──────────╯ |
Regarding empty samples:
I was implementing this, but we should decide what happens to
Any good idea? |
For me
I would like to allow it if possible.
This is ugly. I prefer |
Okay, I come with a proposition for the Anyways, here's a proposition: def _reduceat(
array: ArrayLike,
indices: ArrayLike,
axis: int = 0,
dtype=None,
out=None,
*,
ufunc,
value_empty
):
"""Wrapped `np.ufunc.reduceat` to manage edge cases.
The edge cases are the one described in the doc of
`np.ufunc.reduceat`. Different behaviours are the following:
- No exception is raised when `indices[i] < 0` or
`indices[i] >=len(array)`. Instead, the corresponding value
is `value_empty`.
- When not in the previous case, the result is `value_empty` if
`indices[i] >= indices[i+1]` and otherwise, the same as
`ufunc.reduce(array[indices[i]:indices[i+1]])`.
"""
array, indices = map(np.asarray, [array, indices])
axis %= array.ndim
ax_idx = (slice(None),) * axis
n = array.shape[axis]
pad_width = np.full((array.ndim, 2), 0)
pad_width[axis, 1] = 1
extended_array = np.pad(array, pad_width, mode="empty")
extended_indices = np.append(indices, n)
bad = (indices < 0) | (indices >= n)
empty = (np.diff(extended_indices) <= 0) | bad
extended_indices[:-1][bad] = n
out = ufunc.reduceat(
extended_array, extended_indices, axis=axis, dtype=dtype, out=out
)[ax_idx + (slice(-1),)]
out[ax_idx + (empty,)] = value_empty
return out which is used like so: >>> array = [[0, 1, 2],
... [0, 2, 1],
... [1, 0, 2],
... [1, 2, 0],
... [2, 0, 1],
... [2, 1, 0]]
>>> indices = [0, 0, 100, 2, 2, -1, 2, 5, 2]
>>>
>>> _reduceat(array, indices, dtype=float, ufunc=np.minimum, value_empty=np.nan, axis=0)
array([[nan, nan, nan],
[ 0., 0., 0.],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[nan, nan, nan],
[ 1., 0., 0.],
[nan, nan, nan],
[ 1., 0., 0.]])
>>> _reduceat(array, indices, dtype=float, ufunc=np.minimum, value_empty=np.nan, axis=-1)
array([[nan, 0., nan, nan, nan, nan, 2., nan, 2.],
[nan, 0., nan, nan, nan, nan, 1., nan, 1.],
[nan, 0., nan, nan, nan, nan, 2., nan, 2.],
[nan, 0., nan, nan, nan, nan, 0., nan, 0.],
[nan, 0., nan, nan, nan, nan, 1., nan, 1.],
[nan, 0., nan, nan, nan, nan, 0., nan, 0.]]) I essentially had 2 ideas for implementation: either by extending the I will next add this to Any comment is appreciated. P.S.: I should point out that |
Also, the following is a problem (from if self.start_indices[-1] >= len(self.points):
raise ValueError("Index in start_indices out of bounds") How then to instanciate a
|
Just for info, I played with a second implementation of Depending on:
I can get from 10x faster to 10x slower... So I wouldn't know which to prioritize at the moment without further digging. def _reduceat2(
array: ArrayLike,
indices: ArrayLike,
axis: int = 0,
dtype=None,
out=None,
*,
ufunc,
value_empty
) -> NDArray:
"""Wrapped `np.ufunc.reduceat` to manage edge cases.
The edge cases are the one described in the doc of
`np.ufunc.reduceat`. Different behaviours are the following:
- No exception is raised when `indices[i] < 0` or
`indices[i] >=len(array)`. Instead, the corresponding value
is `value_empty`.
- When not in the previous case, the result is `value_empty` if
`indices[i] >= indices[i+1]` and otherwise, the same as
`ufunc.reduce(array[indices[i]:indices[i+1]])`.
"""
if not isinstance(axis, int):
raise NotImplementedError
array, indices = map(np.asarray, [array, indices])
ndim = array.ndim
assert -ndim <= axis < ndim
axis %= ndim
pre, (n,), post = map(tuple, np.split(array.shape, [axis, axis + 1]))
shape = pre + (len(indices),) + post
if dtype is None:
dtype = array.dtype
if out is None:
out = np.empty(shape, dtype=dtype)
else:
out = out.astype(dtype)
ii = [slice(None)] * ndim
for i, (a, b) in enumerate(itertools.pairwise(np.append(indices, n))):
ii[axis] = i
ii_out = tuple(ii)
if a < 0 or a >= min(b, n): # Nothing to reduce
out[ii_out] = value_empty
else:
ii[axis] = slice(a, b)
ii_array = tuple(ii)
out[ii_out] = ufunc.reduce(array[ii_array], axis=axis)
return out |
…points) as start_index for empty sample
I wonder how that compares with my own try (that only attempts to work when the indices are non-decreasing): def _reduceat_vnmabus(
ufunc,
array: ArrayLike,
indices: ArrayLike,
axis: int = 0,
dtype=None,
out=None,
*,
value_empty
):
"""
Wrapped `np.ufunc.reduceat` to manage edge cases.
The edge cases are the one described in the doc of
`np.ufunc.reduceat`. Different behaviours are the following:
- No exception is raised when `indices[i] < 0` or
`indices[i] >=len(array)`. Instead, the corresponding value
is `value_empty`.
- When not in the previous case, the result is `value_empty` if
`indices[i] >= indices[i+1]` and otherwise, the same as
`ufunc.reduce(array[indices[i]:indices[i+1]])`.
"""
array = np.asarray(array)
indices = np.asarray(indices)
n = array.shape[axis]
good_axis_idx = (indices >= 0) & (indices < n) & (np.diff(indices, append=n) > 0)
n_out = len(indices)
out_shape = list(array.shape)
out_shape[axis] = n_out
out = np.full_like(array, value_empty, shape=out_shape)
good_idx = [slice(None)] * array.ndim
good_idx[axis] = good_axis_idx
good_idx = tuple(good_idx)
reduce_at_out = ufunc.reduceat(
array,
indices[good_axis_idx],
axis=axis,
dtype=dtype,
)
out[good_idx] = reduce_at_out A few (small) comments:
|
I agree.
I do not want to overcomplicate things. I would rather have just one version, the one that performs better with lots of data (when performance really matters). This is likely version 1 or my own implementation (I did not benchmark it but should perform similarly). Version 2 has a Python loop over the functions, which will be noticeable when the number of functions is high.
I think that you are probably right here. It even generalizes better to other data types when we want to support them in the future, e.g. using the max and min representable integers.
I agree with you that in practice the mean and variance computed this way would be empty. And it indeed concerns me too. However, I would argue that finding a common grid and interpolating is not something that should be automated in mean and var, unless there is a unique or best way to do it. Otherwise, we prefer the package to be "dumb" and force the user to make a conscious choice. We have implemented a behavior similar to the one found in Tidyfun (which may not be a sensible choice). Our statistician has told us that usually in FDA the data are not kept in an irregular representation for much time, converting them to a basis representation as soon as possible, and performing the computations (such as mean and variance) in that representation. Thus, probably these methods will not be used much. We have currently a member of the team researching the best way to implement the conversion to a basis representation described in https://academic.oup.com/edited-volume/42134/chapter-abstract/356192087 , in which a mixed effects model is used to leverage information from all observations in the functional dataset in order to convert each one (so that the conversion makes sense even for very sparse data), instead of converting each observation individually. |
Just a quick partial answer until I can get around a computer.
If you're referring to
I agree, but the question becomes "when is it tangible?", which is deeply related to "how much data points do we expect for each sample for the average case?", which is rather subjective. I will incorporate your proposal and propose a few bench results soon. In the mean time, I think the primary focus should be, for the package as a whole in relation with #585 : let's identify covering use case scenarios. What I mean by that is that we have the following (most important) degrees of freedom:
And I think we should partition the possibilities in a meaningful way.
I think this would allow for:
I really think this is one brick of what #585 requires. |
Hmmm, then I would ask the following: what is the intended purpose of |
Well, I think there is enough functionality that is meaningful for |
Sorry for the delay, but I was on vacation + job search so I did not have time. As far as I remember, this is only pending the choice for the |
Hi :) No problem, I'm sorry I also got quite busy and left this PR stale! I think that if you want to merge this quickly then we'd better not think too much about the test scenarios I suggested and simply go with your approach. I bet that handling exclusively non-decreasing arrays makes the implementation As for the rest:
If we agree on this, I can do a quick final push with your Cheers! |
Hi, So I tested I undrafted the PR, ready for review/merge @vnmabus :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of minor things before merging.
…egoudout/scikit-fda-irregular into feature/irregular_operations
Automatic casting to float dtype is removed, as it prevents using different float sizes (or even integers in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we dropped support for Python 3.9, tests now pass. So, I will merge it with the other branch, and if everything goes smoothly, I will merge the other with develop.
Thank @eliegoudout for all the review and programming effort that you have put here!
21f7bad
into
GAA-UAM:feature/irregular_operations
@all-contributors please add @eliegoudout for code, review and ideas. |
I've put up a pull request to add @eliegoudout! 🎉 |
Hello,
I started looking at
FDataIrregular
out of curiosity and I propose a few modifications. I have not reviewed everything at all, but I might look at some more when I have spare time and motivation.Current modifications:
len(array.shape)
witharray.ndim
,FDataIrregular.points_split
andFDataIrregular.values_split
for clarity and to avoid repeatedly callingnp.split(self.points, self.start_indices[1:])
. As discussed in `FDataIrregular` discussion #592, it might be a good idea to make thesecached_property
's, but would require careful correspondngdeleter
implementation,FDataIrregular.restrict
(better vectorization use, deleted loop over dimensions, no set-to-list or list-to-set operation) and addedwith_bound
option for signature consistency withFDataGrid
, but didn't implement it. Usingwith_bounds=True
will raise aNotImplementedError
,FDataIrregular.concatenate
. It is generally not much faster, but I witnessed up to 40% gain (extreme case with lots of data and pretty high (co)dimensionality).To do:
start_indices
(non-decreasing, allow empty samples andstart_index = len(points)
,reduceat
or find alternative,_get_sample_range_from_data
+ empty samples support (NaN
),_get_domain_range_from_sample_range
NaN
support,_sort_by_arguments
,_to_data_matrix
,Later work:
__getitem__
([]
error),(np.inf, -np.inf)
.