`FDataIrregular` personal proposed reviews #593

eliegoudout · 2023-11-14T16:30:38Z

Hello,

I started looking at FDataIrregular out of curiosity and I propose a few modifications. I have not reviewed everything at all, but I might look at some more when I have spare time and motivation.

Current modifications:

Replaced occurrences of len(array.shape) with array.ndim,
Added properties FDataIrregular.points_split and FDataIrregular.values_split for clarity and to avoid repeatedly calling np.split(self.points, self.start_indices[1:]). As discussed in `FDataIrregular` discussion #592, it might be a good idea to make these cached_property's, but would require careful correspondng deleter implementation,
Cleaned FDataIrregular.restrict (better vectorization use, deleted loop over dimensions, no set-to-list or list-to-set operation) and added with_bound option for signature consistency with FDataGrid, but didn't implement it. Using with_bounds=True will raise a NotImplementedError,
Cleaned FDataIrregular.concatenate. It is generally not much faster, but I witnessed up to 40% gain (extreme case with lots of data and pretty high (co)dimensionality).

To do:

Later work:

Fix __getitem__ ([] error),
See below range for empty samples could be (np.inf, -np.inf).

codecov · 2023-11-14T16:36:05Z

Codecov Report

Attention: Patch coverage is 90.58824% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 86.65%. Comparing base (a875bd7) to head (cd7e73e).

Files	Patch %	Lines
skfda/representation/irregular.py	93.50%	5 Missing ⚠️
skfda/representation/basis/_fdatabasis.py	0.00%	2 Missing ⚠️
skfda/typing/_numpy.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@                       Coverage Diff                        @@
##           feature/irregular_operations     #593      +/-   ##
================================================================
- Coverage                         86.68%   86.65%   -0.03%     
================================================================
  Files                               156      156              
  Lines                             13326    13314      -12     
================================================================
- Hits                              11551    11537      -14     
- Misses                             1775     1777       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

vnmabus · 2023-11-16T10:17:00Z

* Replaced occurrences of `len(array.shape)` with `array.ndim`,

I agree, this is better.

* Added properties `FDataIrregular.points_split` and `FDataIrregular.values_split` for clarity and to avoid repeatedly calling `np.split(self.points, self.start_indices[1:])`. As discussed in [`FDataIrregular` discussion #592](https://github.com/GAA-UAM/scikit-fda/discussions/592), it might be a good idea to make these `cached_property`'s, but would require careful correspondng `deleter` implementation,

I do not think this is necessary. There are three places in which split takes place:

_get_sample_range_from_data, where the minimum and maximum per function are computed. I think we can compute them without splitting, using np.minimum.reduceat and np.maximum.reduceat.
_sort_by_arguments, where the arguments per function are sorted. I think here there is no alternative, but this is done only once on object creation.
restrict. I do not think that it is necessary here, as you can compute the points to remove independently of the function where they belong, and then use add.reduceat to compute the new indices.

* Cleaned `FDataIrregular.restrict` (better vectorization use, deleted loop over dimensions, no set-to-list or list-to-set operation) and added `with_bound` option for signature consistency with `FDataGrid`, but didn't implement it. Using `with_bounds=True` will raise a `NotImplementedError`,

Thank you! I also did not notice that functions without values are not kept (also in the original). I think this would be surprising behavior, as the n-th function after restrict would not be the same. I would kept them, even if they have no points.

* Cleaned `FDataIrregular.concatenate`. It is generally not much faster, but I witnessed up to 40% gain (extreme case with lots of data and pretty high (co)dimensionality).

I am somehow surprised that appending to a list and compute np.concatenate is faster than allocating all the memory beforehand and assigning ranges. Any idea why? Maybe it is because the original code looped several times?

eliegoudout · 2023-11-17T20:28:28Z

I do not think [points_splitand values_split properties are] necessary. There are three places in which split takes place:

* `_get_sample_range_from_data`, where the minimum and maximum per function are computed. I think we can compute them without splitting, using `np.minimum.reduceat` and `np.maximum.reduceat`.

* `_sort_by_arguments`, where the arguments per function are sorted. I think here there is no alternative, but this is done only once on object creation.

* `restrict`. I do not think that it is necessary here, as you can compute the points to remove independently of the function where they belong, and then use `add.reduceat` to compute the new indices.

Thanks for you're remarks. I was not familier with reduceat and indeed it might simplify some things. However, I would still argue that these properties do have proper interest:

For maintainability, much easier to read,
Lots of functionalities still remain to be implemented, so these might to come in handy,
Simply for the end user, who might not want to have to think "how can I get the individual samples?", especially if they are not very familiar with the library, or with numpy in general,
If we end up making them cached_propertys, there might also be some speed up (though probably negligible most of the time).

I don't have a super strong opinion about this, but I would still vote for these properties.

Regarding restrict, you're totally right, I will rewrite it, especially if we agree on your following point.

[F]unctions without values are not kept [..]. I think this would be surprising behavior, as the n-th function after restrict would not be the same. I would kept them, even if they have no points.

Hmm... that's a good point. I didn't even question the behaviour since I only meant to clean, but it might indeed be more sensible to keep empty functions. If there is a consensus, I'd gladly incorporate this behaviour.

Any idea why? Maybe it is because the original code looped several times?

Pure speculations:

Multiple loops indeed,
Also, if the total array is big, maybe it cannot all go into fast memory, making it slower to fill little by little.

eliegoudout · 2023-11-17T20:43:52Z

I made another comment on purpose to avoid confusion. I would like to know what the position on start_indices is. Currently, nothing fundamentaly prohibits start_indices=[2,1,4] (decreasing and doesn't start with 0) for example. It does bug atm, but what should the data structure allow?

Personally, I think that:

start_indices should be non-decreasing (but may not be increasing),
start_indices[0] = 0 should be forced.

eliegoudout · 2023-11-18T00:07:15Z

Sorry the comments are starting to pile up, but I discovered this edge case of ufunc.reduceat:

Which means that an "empty sample" (indicated by start_indices = [0, 2, 2, 3] for example for the second sample) would not behave as expected when using reduceat if we're not careful enough.

For example, as you suggested for the review of restrict, if we tried to use add.reduceat to compute the new indices, we could face this issue:

>>> np.add.reduceat(
...  [True, False, True, True, False, True],
...  [0, 2, 2, 3]
... )
array([1, 1, 1, 2])  # Instead of desired array([1, 0, 1, 2)

I don't know if there's any easy fix around this. But since I agree that empty samples should be allowed, reduceat might be limited for our use.

I pin this here for when I have more time to read it, I think it addresses this issue.

vnmabus · 2023-11-18T08:49:39Z

* For maintainability, much easier to read,

Right now, with just one use, I would argue that it is not easier.

* Lots of functionalities still remain to be implemented, so these might to come in handy,

Then we can reevaluate when we have more use cases for it. It is always easier to add features than to remove them.

* Simply for the end user, who might not want to have to think "how can I get the individual samples?", especially if they are not very familiar with the library, or with numpy in general,

But they can simply index the FDataIrregular, if what they want is individual samples, don't they?

* If we end up making them `cached_property`s, there might also be some speed up (though probably negligible most of the time).

I would say that caching something in programming is very dangerous (because cached things can easily become desynchronized), and should be used sparingly, only when the alternatives are much worse.

I don't have a super strong opinion about this, but I would still vote for these properties.

I do not discard the possibility right now, but I prefer to wait until the things are more clear.

Hmm... that's a good point. I didn't even question the behaviour since I only meant to clean, but it might indeed be more sensible to keep empty functions. If there is a consensus, I'd gladly incorporate this behaviour.

I think we should keep them, even if empty.

Pure speculations:

* Multiple loops indeed,

* Also, if the total array is big, maybe it cannot all go into fast memory, making it slower to fill little by little.

Well, until we do not have benchmark tests (#585), I think it is very difficult to justify doing micro-optimizations blindly, so I would keep your code for now.

vnmabus · 2023-11-18T08:51:05Z

Personally, I think that:

* `start_indices` should be non-decreasing (but may not be increasing),

* `start_indices[0] = 0` should be forced.

I agree with both of your suggestions. It was the way I was thinking about start_indices, even if that is not documented anywhere.

vnmabus · 2023-11-18T09:03:11Z

Sorry the comments are starting to pile up, but I discovered this edge case of ufunc.reduceat: Which means that an "empty sample" (indicated by start_indices = [0, 2, 2, 3] for example for the second sample) would not behave as expected when using reduceat if we're not careful enough.

Wow, thank you for noticing this! This is a bit tragic, indeed. I think we can have a workaround for it (by computing the indexes with empty functions and overwriting them), but it will be a bit slower. We should probably encapsulate that into a function to prevent mistakes.

eliegoudout · 2023-11-19T21:45:28Z

Then we can reevaluate when we have more use cases for it. It is always easier to add features than to remove them.

I get this point, it's a good one.

But they can simply index the FDataIrregular, if what they want is individual samples, don't they?

Clearly you're right 😬

I would say that caching something in programming is very dangerous (because cached things can easily become desynchronized), and should be used sparingly, only when the alternatives are much worse.

I thought so but wasn't entirely sure 👍

Overall I agree, I will remove these properties. If they ever become a bit more useful, we could still make them internal properties _points_split and _values_split.

eliegoudout · 2023-11-19T22:05:23Z

Wow, thank you for noticing this! This is a bit tragic, indeed. I think we can have a workaround for it (by computing the indexes with empty functions and overwriting them), but it will be a bit slower. We should probably encapsulate that into a function to prevent mistakes.

It is indeed a bit tragic since, as you can see below, reduceat is blazingly faster! I tested with this (ugly in the form but standard in protocol) piece of code, on non corner cases.

Test ran

from math import ceil, floor, log10
from tabulate import tabulate
from time import perf_counter
import numpy as np


def method_reduceat(mask, idxs):
    """ reduceat """
    return np.add.reduceat(mask, idxs)

def method_split_iterate(mask, idxs):
    """ split then iterate """
    split = np.split(mask, idxs)[1:]
    return np.array([s.sum() for s in split])

def method_only_iterate(split):
    """ iterate over precomputed split """
    return np.array([s.sum() for s in split])

def time_one(n_points, n_samples):
    """ Times on execution of each in a random order. """
    rand_mask = np.random.random(n_points) > .5
    rand_idxs = np.r_[  # Starts with 0 and strictly increasing
        [0],
        np.random.choice(np.arange(1, n_points), n_samples - 1, replace=False)]
    rand_idxs.sort()
    split = np.split(rand_mask, rand_idxs)[1:]
    fs_args = [
        (method_reduceat, (rand_mask, rand_idxs)),
        (method_split_iterate, (rand_mask, rand_idxs)),
        (method_only_iterate, (split,)),
    ]
    n = len(fs_args)
    perm = np.random.permutation(n)
    res = [None] * n
    ts = np.ndarray(3)
    for i in perm:
        f, args = fs_args[i]
        t = -perf_counter()
        r = f(*args)
        t += perf_counter()
        res[i] = r
        ts[i] = t
        np.testing.assert_array_equal(r, res[perm[0]])
    return ts

# Util from https://github.com/eliegoudout/lasvegas/blob/dev/perf/__init__.py
def format_time(duration: float, num_digits: int = 4) -> str: 
    """ Formats a `float` duration in seconds to a human readable `str`.

    A few examples with `num_digits = 4` are given below, showcasing
    some special cases.
    ```
    ╭───────────────┬────────────────┬───────────────────────────────────────╮
    │   Duration    │     Result     │                  Comment              │
    ├───────────────┼────────────────┼───────────────────────────────────────┤
    │      1.5      │    1.500 ss    │ Significant 0's added                 │
    │      0.56789  │    567.9 ms    │ Last digit is rounded...              │
    │      0.99995  │    1.000 ss    │ ...which can lead to precision loss   │
    │      0.12345  │    123.4 ms    │ Rounds half to even (python built-in) │
    │   1234        │    1234. ss    │ Point is added for constant witdh     │
    │  12345        │    12345 ss    │ One more digit for longer durations   │
    │ 123456        │ AssertionError │ Exceeded max duration                 │
    │     -1        │ AssertionError │ Negative duration                     │
    │      0        │    0.000 as    │ Smallest unit for shorter durations   │
    │      5.67e-20 │    0.057 as    │ Precision is worse near 0.            │
    ╰───────────────┴────────────────┴───────────────────────────────────────╯
    ```

    Implementation heavily relies on following facts:
        - Consecutive units have constant ratio of `10 ** 3`,
        - Highest unit is the unit of `duration`'s encoding.

    Arguments:
    ----------
        duration (float): Expressed in seconds, duration to format. Must
            satisfy `0 <= duration < 10 ** (num_digits + 1) - .5`.
        num_digits (int): Number of significant digits to display.
            Larger durations can have one more and shorter durations
            less -- see examples above.

    Returns:
    --------
        (str): Formated duration -- _e.g._ `'567.9 ms'`.

    Raises:
    -------
        `AssertionError` if either `num_digits < 3` or
            `not 0 <= duration < 10 ** (num_digits + 1) - .5`
    """
    units = ['ss', 'ms', 'us', 'ns', 'ps', 'fs', 'as']
    max_pow = 3 * (len(units) - 1)
    n = num_digits
    assert n >= 3
    assert 0 <= duration < 10 ** (n+1) - .5, "Duration out of bounds."
    # Special case 0
    if duration == 0:
        return f"{0:.{n-1}f} " + units[-1]
    # Retrieve left shift for significant part
    left_shift = ceil(- log10(duration)) + n - 1
    significant = round(duration * 10 ** left_shift)
    # Special case `0.0099996` -> `'10.00ms'`
    if significant == 10 ** n:
        significant //= 10
        left_shift -= 1
    # If `duration` is barely too big: remove floating point
    if left_shift == -1:
        return f"{round(duration)} " + units[0]
    # Nominal case
    elif left_shift < max_pow + n:
        unit_index = max(0, 1 + (left_shift - n) // 3)
        y = significant * 10 ** (3 * unit_index - left_shift)
        n_left = int(log10(y) + 1)
        unit = units[unit_index]
        return f"{y:.{max(0, n-n_left)}f}{'.' if n == n_left else ''} " + unit
    # If so small that smallest unit loses precision
    else:
        return f"{duration * 10 ** max_pow:.{n-1}f} " + units[-1]

def expe(n_points, n_samples, n_runs):
    """ Runs multiple times and shows results. """
    T = np.r_[[time_one(n_points, n_samples) for _ in range(n_runs)]]
    T.sort(0)
    min_ = T.min(0)
    max_ = T.max(0)
    mean = T.mean(0)
    p = [5, 25, 50, 75, 95]
    std_p = T[ceil(len(T)*p[1]/100):floor(len(T)*p[-1]/100)+1].std(0)
    percentiles = np.percentile(T, p, 0)
    res = np.c_[min_, max_, mean, std_p, percentiles.T]
    names = ["reduceat", "split + iterate", "iterate only (pre-split)"]
    headers = ['name', 'Min', 'Max', 'Mean', f'Std {p[0]}-{p[-1]}', *map(lambda pp: f"{pp}%", p)]
    table = [[name] + list(map(format_time, line)) 
             for name, line in zip(names, res)]
    colalign = ('left',) + ('center',) * (res.shape[1])
    print(tabulate(table, headers=headers, tablefmt="rounded_outline", colalign=colalign))

n_points = 10000
n_samples = 1000
n_runs = 1000

expe(n_points, n_samples, n_runs)

Results:

>>> # Compute new `start_indices` given mask
>>> n_points = 10000
>>> n_samples = 1000
>>> n_runs = 1000
>>> 
>>> expe(n_points, n_samples, n_runs)
╭──────────────────────────┬──────────┬──────────┬──────────┬────────────┬──────────┬──────────┬──────────┬──────────┬──────────╮
│ name                     │   Min    │   Max    │   Mean   │  Std 5-95  │    5%    │   25%    │   50%    │   75%    │   95%    │
├──────────────────────────┼──────────┼──────────┼──────────┼────────────┼──────────┼──────────┼──────────┼──────────┼──────────┤
│ reduceat                 │ 34.72 us │ 3.478 ms │ 46.86 us │  6.896 us  │ 35.19 us │ 36.67 us │ 37.98 us │ 44.99 us │ 69.08 us │
│ split + iterate          │ 3.950 ms │ 15.35 ms │ 4.543 ms │  425.9 us  │ 4.028 ms │ 4.113 ms │ 4.173 ms │ 4.620 ms │ 6.090 ms │
│ iterate only (pre-split) │ 2.155 ms │ 10.45 ms │ 2.537 ms │  255.1 us  │ 2.203 ms │ 2.276 ms │ 2.317 ms │ 2.507 ms │ 3.335 ms │
╰──────────────────────────┴──────────┴──────────┴──────────┴────────────┴──────────┴──────────┴──────────┴──────────┴──────────╯

eliegoudout · 2023-11-24T13:06:09Z

Regarding empty samples:

I think we should keep them, even if empty.

I was implementing this, but we should decide what happens to sample_range in this case.
We could:

make it + or - np.inf (would cause issues with domain_range if it is deduced from sample_range),
forbid empty samples at instanciation, which wouls mean that empty sample always inherit from a previous sample_range (I think it's a very weird behaviour, but probably the least effort solution)
reconsider this attribute and its interaction with domain_range (for example, saying that -np.inf, +np.inf is ignored by domain_range). This can be done manually of by keeping track of empty samples via yet another (internal) attribute...
?

Any good idea?

vnmabus · 2023-11-24T13:09:40Z

* make it + or - `np.inf` (would cause issues with `domain_range` if it is deduced from `sample_range`),

For me np.nan makes much more sense.

* forbid empty samples at instanciation, which wouls mean that empty sample always inherit from a previous `sample_range` (I think it's a very weird behaviour, but probably the least effort solution)

I would like to allow it if possible.

* reconsider this attribute and its interaction with `domain_range` (for example, saying that `-np.inf, +np.inf` is ignored by `domain_range`). This can be done manually of by keeping track of empty samples via yet another (internal) attribute...

This is ugly. I prefer np.nan, which is more accurate and easier to ignore, I think.

eliegoudout · 2023-11-24T16:52:43Z

Okay, I come with a proposition for the reduceat problem. It was kind of painful because of all the subtle edge cases...

Anyways, here's a proposition:

def _reduceat(
    array: ArrayLike,
    indices: ArrayLike,
    axis: int = 0,
    dtype=None,
    out=None,
    *,
    ufunc,
    value_empty
):
    """Wrapped `np.ufunc.reduceat` to manage edge cases.

    The edge cases are the one described in the doc of
    `np.ufunc.reduceat`. Different behaviours are the following:
        - No exception is raised when `indices[i] < 0` or
            `indices[i] >=len(array)`. Instead, the corresponding value
            is `value_empty`.
        - When not in the previous case, the result is `value_empty` if
            `indices[i] >= indices[i+1]` and otherwise, the same as
            `ufunc.reduce(array[indices[i]:indices[i+1]])`.
    """
    array, indices = map(np.asarray, [array, indices])
    axis %= array.ndim
    ax_idx = (slice(None),) * axis
    n = array.shape[axis]

    pad_width = np.full((array.ndim, 2), 0)
    pad_width[axis, 1] = 1
    extended_array = np.pad(array, pad_width, mode="empty")
    extended_indices = np.append(indices, n)

    bad = (indices < 0) | (indices >= n)
    empty = (np.diff(extended_indices) <= 0) | bad
    extended_indices[:-1][bad] = n

    out = ufunc.reduceat(
        extended_array, extended_indices, axis=axis, dtype=dtype, out=out
    )[ax_idx + (slice(-1),)]
    out[ax_idx + (empty,)] = value_empty

    return out

which is used like so:

>>> array = [[0, 1, 2],
...          [0, 2, 1],
...          [1, 0, 2],
...          [1, 2, 0],
...          [2, 0, 1],
...          [2, 1, 0]]
>>> indices = [0, 0, 100, 2, 2, -1, 2, 5, 2]
>>> 
>>> _reduceat(array, indices, dtype=float, ufunc=np.minimum, value_empty=np.nan, axis=0)
array([[nan, nan, nan],
       [ 0.,  0.,  0.],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [nan, nan, nan],
       [ 1.,  0.,  0.],
       [nan, nan, nan],
       [ 1.,  0.,  0.]])
>>> _reduceat(array, indices, dtype=float, ufunc=np.minimum, value_empty=np.nan, axis=-1)
array([[nan,  0., nan, nan, nan, nan,  2., nan,  2.],
       [nan,  0., nan, nan, nan, nan,  1., nan,  1.],
       [nan,  0., nan, nan, nan, nan,  2., nan,  2.],
       [nan,  0., nan, nan, nan, nan,  0., nan,  0.],
       [nan,  0., nan, nan, nan, nan,  1., nan,  1.],
       [nan,  0., nan, nan, nan, nan,  0., nan,  0.]])

I essentially had 2 ideas for implementation: either by extending the array so I acn use index = len(array) for bad values. Or keep the original array, use index = len(array) - 1 and then keep track of those edge cases where I need to reapply ufunc once.
I first tried the second method as it seems a bit more efficient in case of huge arrays, but in the end, I became a bit too ugly for me.

I will next add this to irregular.py and use it whenever useful (lots of places, especially since we allow empty samples now).

Any comment is appreciated.

P.S.: I should point out that dtype is important here and the behaviour might need to be adjusted. For example, if array is only ints and value_empty = np.nan, then an error will occur because it's not convertible. The problem might really arises when the conversion silently works.

eliegoudout · 2023-11-24T17:05:35Z

Also, the following is a problem (from __init__):

        if self.start_indices[-1] >= len(self.points):
            raise ValueError("Index in start_indices out of bounds")

How then to instanciate a fd where tailing samples are empty? I propose one of these solutions:

Either authorize len(self.points) (but not above),
Athorize any integer greater than or equal to len(self.points) and then:
- Either clip to len(self.points),
- Or don't.
Rethink start_indices to maybe have a "pairwise" syntax (meaning that we include the upper index at the end of start_indices.

eliegoudout · 2023-11-26T00:12:55Z

Just for info, I played with a second implementation of _reduceat, simply using reduce multiple times while looping over indices. Sadly, It's pretty wild how differently they behave performance-wise.

Depending on:

array.ndim,
axis,
length of indices,

I can get from 10x faster to 10x slower... So I wouldn't know which to prioritize at the moment without further digging.

def _reduceat2(
    array: ArrayLike,
    indices: ArrayLike,
    axis: int = 0,
    dtype=None,
    out=None,
    *,
    ufunc,
    value_empty
) -> NDArray:
    """Wrapped `np.ufunc.reduceat` to manage edge cases.

    The edge cases are the one described in the doc of
    `np.ufunc.reduceat`. Different behaviours are the following:
        - No exception is raised when `indices[i] < 0` or
            `indices[i] >=len(array)`. Instead, the corresponding value
            is `value_empty`.
        - When not in the previous case, the result is `value_empty` if
            `indices[i] >= indices[i+1]` and otherwise, the same as
            `ufunc.reduce(array[indices[i]:indices[i+1]])`.
    """
    if not isinstance(axis, int):
        raise NotImplementedError

    array, indices = map(np.asarray, [array, indices])
    ndim = array.ndim
    assert -ndim <= axis < ndim
    axis %= ndim
    pre, (n,), post = map(tuple, np.split(array.shape, [axis, axis + 1]))
    shape = pre + (len(indices),) + post

    if dtype is None:
        dtype = array.dtype

    if out is None:
        out = np.empty(shape, dtype=dtype)
    else:
        out = out.astype(dtype)

    ii = [slice(None)] * ndim
    for i, (a, b) in enumerate(itertools.pairwise(np.append(indices, n))):
        ii[axis] = i
        ii_out = tuple(ii)
        if a < 0 or a >= min(b, n):  # Nothing to reduce
            out[ii_out] = value_empty
        else:
            ii[axis] = slice(a, b)
            ii_array = tuple(ii)
            out[ii_out] = ufunc.reduce(array[ii_array], axis=axis)
    return out

…points) as start_index for empty sample

vnmabus · 2023-11-30T07:13:06Z

Any comment is appreciated.

I wonder how that compares with my own try (that only attempts to work when the indices are non-decreasing):

def _reduceat_vnmabus(
    ufunc,
    array: ArrayLike,
    indices: ArrayLike,
    axis: int = 0,
    dtype=None,
    out=None,
    *,
    value_empty
):
    """
    Wrapped `np.ufunc.reduceat` to manage edge cases.

    The edge cases are the one described in the doc of
    `np.ufunc.reduceat`. Different behaviours are the following:
        - No exception is raised when `indices[i] < 0` or
            `indices[i] >=len(array)`. Instead, the corresponding value
            is `value_empty`.
        - When not in the previous case, the result is `value_empty` if
            `indices[i] >= indices[i+1]` and otherwise, the same as
            `ufunc.reduce(array[indices[i]:indices[i+1]])`.
    """
    array = np.asarray(array)
    indices = np.asarray(indices)

    n = array.shape[axis]
    good_axis_idx = (indices >= 0) & (indices < n) & (np.diff(indices, append=n) > 0)

    n_out = len(indices)
    out_shape = list(array.shape)
    out_shape[axis] = n_out
    out = np.full_like(array, value_empty, shape=out_shape)

    good_idx = [slice(None)] * array.ndim
    good_idx[axis] = good_axis_idx
    good_idx = tuple(good_idx)

    reduce_at_out = ufunc.reduceat(
        array,
        indices[good_axis_idx],
        axis=axis,
        dtype=dtype,
    )
    out[good_idx] = reduce_at_out

A few (small) comments:

I think that the first argument should be the ufunc.
Using map hurts readability here, I think. Moreover, I also think that we can assume that this function will only be used with ndarray inputs, so we could remove the conversion step.
I am still a bit unsure on why did you need to extend the arrays.
The second version iterates over the functions using a Python for loop. I would say that is definitely going to be slower when the number of functions is high.

vnmabus · 2023-11-30T07:54:18Z

* I chose to allow `start_index = len(points)` to instanciate empty samples,

I agree.

* I temporarily included both implementations of `_reduceat` with hardocded `MODE = 1` variable (`2` also possible). Later benchmarks might decide what we do. It is a possibility that depending on the dimension of the arguments, we prefer mode `1` over `2` or vice versa, at run time

I do not want to overcomplicate things. I would rather have just one version, the one that performs better with lots of data (when performance really matters). This is likely version 1 or my own implementation (I did not benchmark it but should perform similarly). Version 2 has a Python loop over the functions, which will be noticeable when the number of functions is high.

* I'm wondering if, after all, the best value for `_sample_range` would not be `(np.inf, -np.inf)` for empty samples? It breaks monoticity, but is coherent with the idea than `min(empty_set) = np.inf`. Still, atm I chose `np.nan` as previously discussed.

I think that you are probably right here. It even generalizes better to other data types when we want to support them in the future, e.g. using the max and min representable integers.

I would like to raise personal concern regarding the current choice of using common points for mean and var. I thought the natural use case for FDataIrregular would be to have multiple recordings at slightly different time stamps because of natural uncertainty in the timing process. Then, the best solution, imo, would simply be to interpolate on a grid to determine, and then compute the operation. With the current implementation, one would probably get literally zero common point if they used real-world measures. Furthermore, the common_points with an empty set would also be an empty set. While an interpolated version of an empty set on a common grid would simple be all NaN's.

TLDR: Is the idea to work on common points possible to question? I think a more natural approach is to find an appropriate commone grid (equally spaced or not, to discuss...) and then to interpolate.

I agree with you that in practice the mean and variance computed this way would be empty. And it indeed concerns me too. However, I would argue that finding a common grid and interpolating is not something that should be automated in mean and var, unless there is a unique or best way to do it. Otherwise, we prefer the package to be "dumb" and force the user to make a conscious choice.

We have implemented a behavior similar to the one found in Tidyfun (which may not be a sensible choice). Our statistician has told us that usually in FDA the data are not kept in an irregular representation for much time, converting them to a basis representation as soon as possible, and performing the computations (such as mean and variance) in that representation. Thus, probably these methods will not be used much. We have currently a member of the team researching the best way to implement the conversion to a basis representation described in https://academic.oup.com/edited-volume/42134/chapter-abstract/356192087 , in which a mixed effects model is used to leverage information from all observations in the functional dataset in order to convert each one (so that the conversion makes sense even for very sparse data), instead of converting each observation individually.

eliegoudout · 2023-11-30T07:54:50Z

Just a quick partial answer until I can get around a computer.

I am still a bit unsure on why did you need to extend the arrays.

If you're referring to MODE == 1: because I want to use native reduceat on slices i:len(array), and that natively raises exception. Of course, if you only work with non-decreasing indexes, it removes the problem, but the goal of my fix is to encompass all cases mentioned in the doc string.

I would say that is definitely going to be slower when the number of functions is high.

I agree, but the question becomes "when is it tangible?", which is deeply related to "how much data points do we expect for each sample for the average case?", which is rather subjective.

I will incorporate your proposal and propose a few bench results soon.

In the mean time, I think the primary focus should be, for the package as a whole in relation with #585 : let's identify covering use case scenarios.

What I mean by that is that we have the following (most important) degrees of freedom:

dim
codim
n_points
n_samples

And I think we should partition the possibilities in a meaningful way.
For example, I propose something like (completely fake proposition for demonstration purpose, not a real proposition):

scenario_0: n_points * (dim + codim) < 1000
scenario_1: dim == 1 and n_point < 100 * n_samples
scenario_2: dim == 1 and (not scenario_1
scenario 3: everything else.

I think this would allow for:

better benchmarking in general
better identification of the use cases,
possibly dynamically identifying the scenario to use the right implementations.

I really think this is one brick of what #585 requires.

eliegoudout · 2023-11-30T08:13:48Z

Our statistician has told us that usually in FDA the data are not kept in an irregular representation for much time, converting them to a basis representation as soon as possible

Hmmm, then I would ask the following: what is the intended purpose of FDataIrregular? If this is just a proxy for constructing Grid or basis representation, then I would assume this choice entirely and not implement at all processing methods. What do you think?

vnmabus · 2023-12-01T08:56:34Z

Our statistician has told us that usually in FDA the data are not kept in an irregular representation for much time, converting them to a basis representation as soon as possible

Hmmm, then I would ask the following: what is the intended purpose of FDataIrregular? If this is just a proxy for constructing Grid or basis representation, then I would assume this choice entirely and not implement at all processing methods. What do you think?

Well, I think there is enough functionality that is meaningful for FDataIrregular (plotting, evaluation, etc) to justify it being a FData subclass. But yeah, reduction operations, such as the mean are a big problem. I nevertheless suspect that there MAY be a better way of computing the them taking the correlation between points into account (but without doing smoothing). However, I currently have no time to dedicate to such a derivation. If we discover some way to compute them better, that does not depend in heuristics or smoothing parameters, we could try to change it.

vnmabus · 2024-01-31T10:52:11Z

Sorry for the delay, but I was on vacation + job search so I did not have time. As far as I remember, this is only pending the choice for the _reduceat algorithm, backed by benchmarks if possible. I would like to finish this set of changes as soon as possible, to unlock and merge PR #536.

eliegoudout · 2024-01-31T22:25:36Z

Hi :)

No problem, I'm sorry I also got quite busy and left this PR stale! I think that if you want to merge this quickly then we'd better not think too much about the test scenarios I suggested and simply go with your approach. I bet that handling exclusively non-decreasing arrays makes the implementation _reduceat_vnmabus much more efficient than the ones I proposed, no need to double check that imo.

As for the rest:

Do we leave the sample range for empty samples with NaN's? We agreed that (np.inf, -np.inf) would make a lot of sense, but I'm not confident enough that it wouldn't brake other things somewhere in your package (mainly because it's not non-decreasing anymore).
I think there's still a problem with __getitem__ as I wrote it in the opening message. Can we leave this for later?
I think I will not review the remaining parts of the data structure for now, mainly because as discussed, I'm uncomfortable with the current approach to mean and var for example.

If we agree on this, I can do a quick final push with your reduceat and we should be good to go!

Cheers!

eliegoudout · 2024-02-02T14:15:44Z

Hi,

So I tested _reduceat_vnmabus after fixing the handing of dtype and out. I ran experiments with np.add on a boolean mask with len(mask)=1_000 and respectively len(indices)=30, 300, 3000 and it seemed consistently 30% faster. Since I think handling decreasing indices is not useful anyways, I chose it for this PR.

I undrafted the PR, ready for review/merge @vnmabus :)

vnmabus

Just a couple of minor things before merging.

skfda/representation/irregular.py

…erations

skfda/representation/irregular.py

…egoudout/scikit-fda-irregular into feature/irregular_operations

…erations

Automatic casting to float dtype is removed, as it prevents using different float sizes (or even integers in the future).

vnmabus

As we dropped support for Python 3.9, tests now pass. So, I will merge it with the other branch, and if everything goes smoothly, I will merge the other with develop.

Thank @eliegoudout for all the review and programming effort that you have put here!

vnmabus · 2024-03-11T11:29:41Z

@all-contributors please add @eliegoudout for code, review and ideas.

allcontributors · 2024-03-11T11:29:51Z

@vnmabus

I've put up a pull request to add @eliegoudout! 🎉

eliegoudout added 3 commits November 14, 2023 17:12

use ary.ndim instead of len(ary.shape)

0f90db2

points_split and values_split as properties

26b5f0c

FDataIrregular.cleaned restrict method

fb69594

cleaner concatenate

ad71339

eliegoudout added 6 commits November 23, 2023 19:20

FDataIrregular.__init__: validate start_indices

f6a87ad

FDataIrregular.round clean (why start_indices special treatment?)

eee51ad

minor clean

3bd25fa

FDataIrregular._to_data_matrix clean remove loops

ea7e6dd

revert: remove *_split properties

6e32109

restrict keep empty samples

5b0ba71

eliegoudout added 2 commits November 24, 2023 18:11

_reduceat v0

b107525

_get_sample_range_from_data update

0e9949f

eliegoudout added 3 commits November 29, 2023 10:37

Two-modes _reduceat for later decision

6a8a90d

handle nan for domain range compute + enforce float type + allow len(…

22aa8c7

…points) as start_index for empty sample

clean _sort_by_arguments

11508c1

_reduceat wrapper + minor mods

5a841ed

eliegoudout marked this pull request as ready for review February 2, 2024 14:15

vnmabus requested changes Feb 10, 2024

View reviewed changes

eliegoudout and others added 6 commits February 14, 2024 06:03

removed useless op (???)

1a1835e

cleaner _sort_by_arguments from vnmabus

f0fe0d7

resolve reviews

4a2fc88

better lexsort comment

01fe0ba

Merge branch 'develop' into pr-593

3856e10

Merge branch 'feature/irregular_operations' into feature/irregular_op…

d06e843

…erations

vnmabus reviewed Feb 24, 2024

View reviewed changes

skfda/representation/irregular.py Outdated Show resolved Hide resolved

eliegoudout and others added 6 commits February 26, 2024 15:06

fixed typo domain_range max

ee89c2c

fixed restrict + allow domain_range broadcast

6bf925c

Merge branch 'feature/irregular_operations' of https://github.com/eli…

16c8109

…egoudout/scikit-fda-irregular into feature/irregular_operations

Merge branch 'feature/irregular_operations' into feature/irregular_op…

777bd5d

…erations

Fix doctests.

fb6502f

Automatic casting to float dtype is removed, as it prevents using different float sizes (or even integers in the future).

Fix typo.

cd7e73e

vnmabus approved these changes Mar 11, 2024

View reviewed changes

vnmabus merged commit 21f7bad into GAA-UAM:feature/irregular_operations Mar 11, 2024
9 of 10 checks passed

allcontributors bot mentioned this pull request Mar 11, 2024

add eliegoudout as a contributor for code, review, and ideas #605

Merged

vnmabus mentioned this pull request Mar 11, 2024

Feature/irregular_structure_and_operations #536

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`FDataIrregular` personal proposed reviews #593

`FDataIrregular` personal proposed reviews #593

eliegoudout commented Nov 14, 2023 •

edited

Loading

codecov bot commented Nov 14, 2023 •

edited

Loading

vnmabus commented Nov 16, 2023

eliegoudout commented Nov 17, 2023 •

edited

Loading

eliegoudout commented Nov 17, 2023 •

edited

Loading

eliegoudout commented Nov 18, 2023 •

edited

Loading

vnmabus commented Nov 18, 2023

vnmabus commented Nov 18, 2023

vnmabus commented Nov 18, 2023

eliegoudout commented Nov 19, 2023

eliegoudout commented Nov 19, 2023

eliegoudout commented Nov 24, 2023

vnmabus commented Nov 24, 2023

eliegoudout commented Nov 24, 2023 •

edited

Loading

eliegoudout commented Nov 24, 2023

eliegoudout commented Nov 26, 2023

vnmabus commented Nov 30, 2023

vnmabus commented Nov 30, 2023

eliegoudout commented Nov 30, 2023 •

edited

Loading

eliegoudout commented Nov 30, 2023

vnmabus commented Dec 1, 2023

vnmabus commented Jan 31, 2024

eliegoudout commented Jan 31, 2024

eliegoudout commented Feb 2, 2024

vnmabus left a comment

vnmabus left a comment

vnmabus commented Mar 11, 2024

allcontributors bot commented Mar 11, 2024

FDataIrregular personal proposed reviews #593

FDataIrregular personal proposed reviews #593

Conversation

eliegoudout commented Nov 14, 2023 • edited Loading

codecov bot commented Nov 14, 2023 • edited Loading

Codecov Report

vnmabus commented Nov 16, 2023

eliegoudout commented Nov 17, 2023 • edited Loading

eliegoudout commented Nov 17, 2023 • edited Loading

eliegoudout commented Nov 18, 2023 • edited Loading

vnmabus commented Nov 18, 2023

vnmabus commented Nov 18, 2023

vnmabus commented Nov 18, 2023

eliegoudout commented Nov 19, 2023

eliegoudout commented Nov 19, 2023

eliegoudout commented Nov 24, 2023

vnmabus commented Nov 24, 2023

eliegoudout commented Nov 24, 2023 • edited Loading

eliegoudout commented Nov 24, 2023

eliegoudout commented Nov 26, 2023

vnmabus commented Nov 30, 2023

vnmabus commented Nov 30, 2023

eliegoudout commented Nov 30, 2023 • edited Loading

eliegoudout commented Nov 30, 2023

vnmabus commented Dec 1, 2023

vnmabus commented Jan 31, 2024

eliegoudout commented Jan 31, 2024

eliegoudout commented Feb 2, 2024

vnmabus left a comment

Choose a reason for hiding this comment

vnmabus left a comment

Choose a reason for hiding this comment

vnmabus commented Mar 11, 2024

allcontributors bot commented Mar 11, 2024

`FDataIrregular` personal proposed reviews #593

`FDataIrregular` personal proposed reviews #593

eliegoudout commented Nov 14, 2023 •

edited

Loading

codecov bot commented Nov 14, 2023 •

edited

Loading

eliegoudout commented Nov 17, 2023 •

edited

Loading

eliegoudout commented Nov 17, 2023 •

edited

Loading

eliegoudout commented Nov 18, 2023 •

edited

Loading

eliegoudout commented Nov 24, 2023 •

edited

Loading

eliegoudout commented Nov 30, 2023 •

edited

Loading